Are aarch64 atomics really this sensitive? (A: No)

I noticed a bug in Guile 3.0.9’s aarch64 atomics handling, and found a couple of apparent solutions (1, 2), but one of them is weird enough for me to write this post.

(ETA: Nonstory. The problem was that the mov instruction isn’t idempotent! Hat tip to Andy Wingo for figuring out what the issue was. I’ve updated the rest of the article, and I’ll leave it here for posterity.)

Long story short, the problem was with the equivalent of C’s atomic_exchange. Here’s the code that Guile’s JIT was generating:

1:
    mov     x16, x0
    ldaxr   x0, [x1]
    stlxr   w17, x16, [x1]
    cbnz    w17, 1b

This code appears to occasionally lose writes (!). ETA: This code definitely loses writes when interference means it has to go around the loop.

The first patch I wrote boringly replaced the lot with a single

    swpal   x0, x0, [x1]

which is fine, if you have an ARM v8.1 device to hand, but not if you don’t have a machine with Large System Extensions. So I tried, on a hunch, the second patch, which just changed the target of the cbnz, producing code like this:

    mov     x16, x0
1:
    ldaxr   x0, [x1]
    stlxr   w17, x16, [x1]
    cbnz    w17, 1b

… and the issue disappeared! What! This shouldn’t have made a difference! Should it? ETA: And fair enough, too! If the branch targets the mov instruction, the value of x0 that ldaxr set is used, meaning that the whole operation simply becomes a no-op assignment.

Are aarch64 atomics really this sensitive? Is there only One True Instruction Sequence that should be used to implement atomic_exchange? Why does making this seemingly-insignificant change produce such a noticeable effect? ETA: Nothing to see here :-)