Are aarch64 atomics really this sensitive? (A: No)
Mon 22 Apr 2024 12:00 CEST
I noticed a bug in Guile 3.0.9’s aarch64 atomics handling, and found a couple of apparent solutions (1, 2), but one of them is weird enough for me to write this post.
(ETA: Nonstory. The problem was that the mov
instruction isn’t idempotent! Hat tip to
Andy Wingo for figuring out what the issue was. I’ve updated the rest of the article, and I’ll
leave it here for posterity.)
Long story short, the problem was with the equivalent of C’s
atomic_exchange
. Here’s the code
that Guile’s JIT was generating:
1:
mov x16, x0
ldaxr x0, [x1]
stlxr w17, x16, [x1]
cbnz w17, 1b
This code appears to occasionally lose writes (!). ETA: This code definitely loses
writes when interference means it has to go around the loop.
The first patch I wrote boringly replaced the lot with a single
swpal x0, x0, [x1]
which is fine, if you have an ARM v8.1 device to hand, but not if you don’t have a machine with
Large System
Extensions. So I
tried, on a hunch, the second patch, which just changed the target of the cbnz
,
producing code like this:
mov x16, x0
1:
ldaxr x0, [x1]
stlxr w17, x16, [x1]
cbnz w17, 1b
… and the issue disappeared! What! This shouldn’t have made a difference! Should it?
ETA: And fair enough, too! If the branch targets the mov
instruction, the value of x0
that ldaxr
set is used, meaning that the whole operation simply becomes a no-op assignment.
Are aarch64 atomics really this sensitive? Is there only One True Instruction Sequence that
should be used to implement ETA: Nothing to see here :-)atomic_exchange
? Why does making this seemingly-insignificant
change produce such a noticeable effect?