Are aarch64 atomics really this sensitive? (A: No)

I noticed a bug in Guile 3.0.9’s aarch64 atomics handling, and found a couple of apparent solutions (1, 2), but one of them is weird enough for me to write this post.

(ETA: Nonstory. The problem was that the mov instruction isn’t idempotent! Hat tip to Andy Wingo for figuring out what the issue was. I’ve updated the rest of the article, and I’ll leave it here for posterity.)

Long story short, the problem was with the equivalent of C’s atomic_exchange. Here’s the code that Guile’s JIT was generating:

    mov     x16, x0
    ldaxr   x0, [x1]
    stlxr   w17, x16, [x1]
    cbnz    w17, 1b

This code appears to occasionally lose writes (!). ETA: This code definitely loses writes when interference means it has to go around the loop.

The first patch I wrote boringly replaced the lot with a single

    swpal   x0, x0, [x1]

which is fine, if you have an ARM v8.1 device to hand, but not if you don’t have a machine with Large System Extensions. So I tried, on a hunch, the second patch, which just changed the target of the cbnz, producing code like this:

    mov     x16, x0
    ldaxr   x0, [x1]
    stlxr   w17, x16, [x1]
    cbnz    w17, 1b

… and the issue disappeared! What! This shouldn’t have made a difference! Should it? ETA: And fair enough, too! If the branch targets the mov instruction, the value of x0 that ldaxr set is used, meaning that the whole operation simply becomes a no-op assignment.

Are aarch64 atomics really this sensitive? Is there only One True Instruction Sequence that should be used to implement atomic_exchange? Why does making this seemingly-insignificant change produce such a noticeable effect? ETA: Nothing to see here :-)

More pitfalls regarding JavaScript's non-monadic promises

As is well-known, JavaScript’s Promise is not a monad. It will happily treat Promise<Promise<T>> as if it was Promise<T>:

> [123, await Promise.resolve(123), await Promise.resolve(Promise.resolve(123))]
[ 123, 123, 123 ]

This can bite you in unexpected ways. Imagine you have a CSP-like Channel<T> class for sending Ts back and forth. Channel<T> might have a method like this:

async pop(): Promise<T | undefined> { ... }

There’s an obvious problem here: what if undefinedT? So you make sure to note, in the comment attached to Channel<T>, that T is not allowed to include undefined.

But the less obvious problem is that T is not allowed to contain Promise<undefined> either, even though in other contexts a promise of undefined cannot be confused with undefined:

> typeof undefined
> typeof Promise.resolve(undefined)

To see why this is a problem, instantiate T with Promise<undefined>, and look at the type of pop():

Promise<Promise<undefined> | undefined>

Because JavaScript collapses promises-of-promises to just promises, this is equivalent to just


and you’ve lost the ability to tell whether pop() yielded a T or an undefined.

TypeScript does not warn you about this, incidentally. (Ask me how I know.)


Instead of accepting this loss of structure and adding another caveat to Channel<T> to work around JavaScript’s broken design—“T must not include either undefined or Promise<undefined> or Promise<Promise<undefined>> etc.”—I decided to change the signature of pop():

async pop(): Promise<Maybe<T>> { ... }

type Maybe<T> = Just<T> | undefined;
type Just<T> = { item: T };

Now both Channel<undefined> and Channel<Promise<undefined>> are sensible and work as expected. No more exceptions regarding what Ts a Channel may carry.

When T is Promise<undefined>, in particular, we see that the type of pop() is

Promise<{ item: Promise<undefined> } | undefined>

Because the Promises aren’t immediately nested, JavaScript won’t erase our structure.

(Ironically, we’ve introduced a monad (Maybe<T>) to fix the bad behaviour of something that should have been a monad…)

Python3 is removing crypt.crypt and not replacing it with anything ¯\_(ツ)_/¯

Python 3.13 will, for inscrutable reasons, remove the crypt module from the standard library. The excuses given in PEP 0594 boil down to “here are some good reasons why new code shouldn’t use this module.” What about existing code? Ah well.

So anyway, for those of us who need some way of generating $6$ SHAcrypt SHA-512 shadow-password-database entries from Python, stick the following module into your codebase (you can also download it, and replace code like

crypt.crypt(password, salt=crypt.METHOD_SHA512)
crypt.crypt(password, '$6$salt$...')
crypt.crypt(password, '$6$salt$...') == '$6$salt$...'


shacrypt512.shacrypt(password.encode('utf-8'), b'salt')
shacrypt512.password_ok(password.encode('utf-8'), '$6$salt$...')


Without further ado, here’s

# SHAcrypt using SHA-512, after
# Copyright © 2024 Tony Garnock-Jones.
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.

import hashlib
import secrets

alphabet = \
    [ord(c) for c in './0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz']
permutation = [
    [0, 21, 42], [22, 43, 1], [44, 2, 23], [3, 24, 45],
    [25, 46, 4], [47, 5, 26], [6, 27, 48], [28, 49, 7],
    [50, 8, 29], [9, 30, 51], [31, 52, 10], [53, 11, 32],
    [12, 33, 54], [34, 55, 13], [56, 14, 35], [15, 36, 57],
    [37, 58, 16], [59, 17, 38], [18, 39, 60], [40, 61, 19],
    [62, 20, 41], [-1, -1, 63],
def encode(bs64):
    result = bytearray(4 * len(permutation))
    i = 0
    for group in permutation:
        g = lambda j: bs64[j] if j != -1 else 0
        bits = g(group[0]) << 16 | g(group[1]) << 8 | g(group[2])
        result[i] = alphabet[bits & 63]
        result[i+1] = alphabet[(bits >> 6) & 63]
        result[i+2] = alphabet[(bits >> 12) & 63]
        result[i+3] = alphabet[(bits >> 18) & 63]
        i = i + 4
    return bytes(result).decode('ascii')[:-2]

def repeats_of(n, bs): return bs * int(n / len(bs)) + bs[:n % len(bs)]
def digest(bs): return hashlib.sha512(bs).digest()

def shacrypt(password, salt = None, rounds = 5000):
    if salt is None: salt = encode(secrets.token_bytes(64))[:16].encode('ascii')
    salt = salt[:16]

    B = digest(password + salt + password)
    Ainput = password + salt + repeats_of(len(password), B)
    v = len(password)
    while v > 0:
        Ainput = Ainput + (B if v & 1 else password)
        v = v >> 1
    A = digest(Ainput)

    DP = digest(password * len(password))
    P = repeats_of(len(password), DP)
    DS = digest(salt * (16+A[0]))
    S = repeats_of(len(salt), DS)

    C = A
    for round in range(rounds):
        Cinput = b''
        Cinput = Cinput + (P if round & 1 else C)
        if round % 3: Cinput = Cinput + S
        if round % 7: Cinput = Cinput + P
        Cinput = Cinput + (C if round & 1 else P)
        C = digest(Cinput)

    if rounds == 5000:
        return '$6$' + salt.decode('ascii') + '$' + encode(C)
        return '$6$rounds=' + str(rounds) + '$' + salt.decode('ascii') + '$' + encode(C)


def extract_salt_and_rounds(i): # i must be '$6$...'
    pieces = i.split('$')
    if pieces[1] != '6': raise TypeError('shacrypt512 only supports $6$ hashes')
    if pieces[2].startswith('rounds='):
        rounds = int(pieces[2][7:])
        if rounds < 1000: rounds = 1000
        if rounds > 999999999: rounds = 999999999
        return (pieces[3].encode('ascii'), rounds)
        return (pieces[2].encode('ascii'), 5000)

def password_ok(input_password, existing_crypted_password):
    (salt, rounds) = extract_salt_and_rounds(existing_crypted_password)
    return existing_crypted_password == shacrypt(input_password, salt, rounds)

if __name__ == '__main__':
    _test_password = 'Hello world!'.encode('ascii')
    _test_salt = 'saltstring'.encode('ascii')
    _test_rounds = 5000
    _test_crypted_password = '$6$saltstring$svn8UoSVapNtMuq1ukKS4tPQd8iKwSMHWjl/O817G3uBnIFNjnQJuesI68u4OTLiBFdcbYEdFCoEOfaS35inz1'
    assert shacrypt(_test_password, _test_salt, _test_rounds) == _test_crypted_password
    assert password_ok(_test_password, _test_crypted_password)

    _test_password = 'Hello world!'.encode('ascii')
    _test_salt = 'saltstringsaltstring'.encode('ascii')
    _test_rounds = 10000
    _test_crypted_password = '$6$rounds=10000$saltstringsaltst$OW1/O6BYHV6BcXZu8QVeXbDWra3Oeqh0sbHbbMCVNSnCM/UrjmM0Dp8vOuZeHBy/YTBmSK6H9qs/y3RnOaw5v.'
    assert shacrypt(_test_password, _test_salt, _test_rounds) == _test_crypted_password
    assert password_ok(_test_password, _test_crypted_password)

    import sys
    salt = None if len(sys.argv) < 2 else sys.argv[1].encode('ascii')
    print(shacrypt(sys.stdin.readline().strip().encode('utf-8'), salt))

Unmangling paths set by direnv on Windows 11

direnv works fine on Windows 11, but if an .envrc tries to set the PATH, the result will be a path in Windows format, not Unix format.1

Instead of adding eval $(direnv hook bash) to your .bashrc, try the following snippet:

export _unmangle_direnv_names='PATH'
_unmangle_direnv_paths() {
    for k in $_unmangle_direnv_names; do
        eval "$k=\"\$(/usr/bin/cygpath -p \"\$$k\")\""
eval "$(direnv hook bash | sed -e 's@export bash)@export bash)\

This modifies the output of direnv hook bash slightly, adding code to fix path-like variables after direnv sets the environment up.2

The variable names to unmangle are drawn from a new variable, _unmangle_direnv_names, initially set to PATH, which should contain a space-separated list of variable names.

If, in a particular .envrc, you need path-unmangling for an additional variable, you can add that variable’s name to _unmangle_direnv_names. For example,

_unmangle_direnv_names="$_unmangle_direnv_names XPATH"
export PATH="$PATH:some_addition"
export XPATH="$PATH:some_addition"

will unmangle both PATH and XPATH.

  1. See direnv issues 253 (“PATH gets mangled when using direnv from git-bash on Windows”) and 796 (“Incorrect path format is exported on Windows 10 with mintty / git bash, breaking the PATH and command resolution”). 

  2. While experimenting, I discovered direnv export json! Very nice. It’s great to see more and more tools using structured data for their inputs and outputs. 

On the harm caused by missing basic (basic!) functionality in Signal, WhatsApp, Android and iOS

Trusting Signal and/or WhatsApp and/or Android (google) and/or iOS (apple) with your precious photos, videos, and chats is a huge mistake.

All the photos, videos and chat history on my mother’s phone are now completely inaccessible to us, despite having recent backups and all the necessary keys and passphrases.


Can I transfer Signal backups directly from android to iphone? No.

Can I make a backup file and transfer that wirelessly without uploading it to the cloud? No.

Can I transfer it using a USB cable? Also no.

If I transfer it using the cloud, can I then restore from it on the iphone? … No.

It is not possible to transfer Signal chat history and media from an android phone to an iphone.


What about Whatsapp?

Can I transfer Whatsapp backups directly from android to iphone? No.

Can I make a backup file and transfer that? No.

Can I back up to google drive and use that? No.

Even if I could get the backup file, would I be able to import it on the iphone? Also no.

It is not possible to transfer WhatsApp chat history and media from an android phone to an iphone.

The entire ecosystem is sick

Signal is to blame: they do not make it possible to import backup data on iphone. They do not offer tools for working with backup data.

WhatsApp is to blame: they do not make it possible to retrieve or work with android backup data. They do not make it possible to migrate data from android to ios without fully wiping and resetting the iphone (and it didn’t work even when we did fully wipe it).

Apple is to blame: there is no way to transfer files from android without going through the cloud. Even using a usb cable doesn’t work.

Google is to blame: there is no way to transfer files to iphone without going through the cloud. There is no way to access a whatsapp backup blob in google drive containing my own data.

We all are to blame: we have accepted and continue to make excuses for an industry that acts in such a user-hostile way.

Simpler Preserves Binary Syntax

I’ve just updated the Preserves spec to version 0.990.0. I feel like a 1.0-rc is approaching!

The main change since spec version 0.7.1 has been to simplify the binary syntax for Preserves Values:

  • Both “short” and “medium” SignedInteger representations (starting with tags 0x9x/0xAx) were removed. They weren’t pulling their weight. Every SignedInteger now has tag 0xB0.

  • Float and Double are now encoded with tag 0x87 and a length, rather than with fixed tags 0x82 and 0x83, opening the door to other IEEE754 formats in future.

Here’s the 0.990.0 syntax in “reference card” format, where we write «V» for the binary encoding of some value V:

                      «#f» = [0x80]
                      «#t» = [0x81]

                    «@W V» = [0x85] ++ «W» ++ «V»
                     «#!V» = [0x86] ++ «V»

  «V» if V ∈ Float         = [0x87, 0x04] ++ binary32(V)
  «V» if V ∈ Double        = [0x87, 0x08] ++ binary64(V)

  «V» if V ∈ SignedInteger = [0xB0] ++ varint(|intbytes(V)|) ++ intbytes(V)
  «V» if V ∈ String        = [0xB1] ++ varint(|utf8(V)|) ++ utf8(V)
  «V» if V ∈ ByteString    = [0xB2] ++ varint(|V|) ++ V
  «V» if V ∈ Symbol        = [0xB3] ++ varint(|utf8(V)|) ++ utf8(V)

           «<L F_1...F_m>» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84]
             «[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84]
            «#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84]
     «{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84]

                 varint(v) = [v] if v < 128
                             [(v & 0x7F) + 128] ++ varint(v >> 7) if v ≥ 128

Joining Markdown tables

I’ve been working with Markdown tables a lot recently. The lovely markdown-mode for Emacs makes this easy and pleasant. However, I find myself treating the tables a little like CSV, and started wanting to run joins on tables often enough that I hacked together my first little Emacs package, markdown-join.

Here’s a demo:

The code is a single file, markdown-join.el. I’ve submitted it to MELPA.

Virtualizing uxn

uxn is a delightful virtual-machine specification (with lots of implementations) supporting 16-bit operation.

At present, it is a “real mode” machine, and there is no support for running virtual uxn machines under an uxn hypervisor.

It’d be neat to have a virtualizable uxn system: you could write a full operating system, with memory protection and multi-tasking, entirely in uxntal, and existing ROMs wouldn’t even know the difference! Or you could write a debugger that made good use of the native uxn core where possible. You could serialize a program, suspending it to be resumed later, or transfer it across the network to a different host, all without having to modify the base uxn emulator.

So: Can we make uxn virtualizable, so that an uxn hypervisor can manage one or more child virtual machines, each running a program presented with a seamless illusion that it’s the only software on the machine? Let’s find out!

What is virtualizability?

In 1974, Gerald J. Popek and Robert P. Goldberg wrote a seminal paper laying out a simple and robust approach to determining whether a particular machine architecture is virtualizable. Quoting from the paper’s wikipage,

There are three properties of interest when analyzing the environment created by a VMM [a Virtual Machine Monitor, a.k.a. a Hypervisor]:

Equivalence / Fidelity. A program running under the VMM should exhibit a behavior essentially identical to that demonstrated when running on an equivalent machine directly.

Resource control / Safety. The VMM must be in complete control of the virtualized resources.

Efficiency / Performance. A statistically dominant fraction of machine instructions must be executed without VMM intervention.

Is real-mode uxn virtualizable?

Uxn as specified doesn’t have a supervisor-mode/user-mode split, so we will have to adapt Popek and Goldberg’s development.

Sometimes the uxn machine takes a trap, but its behavior is defined to transition to a metamachine, not to some supervisor mode. In case of errors, the metamachine will often reactivate the machine at some error handler vector, but again, it’s in normal operation mode, not some supervisor mode. This means that there are no privileged instructions in uxn.

There are no base-and-bounds registers, and the range of pointers exactly covers accessible RAM (64k!), so there are no out-of-bounds RAM accesses possible. However, it is possible to underflow or overflow the stack, which falls in the spirit of Popek and Goldberg’s memorytrap. When this happens, the metamachine takes over. BRK and DIV are the only instructions which trap or fault otherwise than memorytrap.

There is no way to change the amount of available memory, and no mode changes possible, so there are no control sensitive instructions.

While there is no memory virtualization in uxn, consideration of behavior sensitive instructions requires us to posit an address translation facility. But we can easily make sure this doesn’t fall foul of Popek and Goldberg’s first (relocation) criterion of behavior sensitivity; and as there are no modes, their second criterion fails also. So there are no behavior sensitive instructions.

Thus, because all of the (nonexistent) control- and/or behavior-sensitive instructions fall in the (empty) set of privileged instructions, we can rely on Popek and Goldberg’s theorem 1, and say that construction of a VMM for uxn must be feasible. In fact, it’s almost there as it stands.

Reaching out further, to recursive virtualizability, is in my opinion the simplest path to virtualization. Execution of uxn is deterministic, given the devices in play today! Thus, by Popek and Goldberg’s theorem 2, uxn must be able to be recursively virtualizable.

Virtualized uxn

Let’s explore one possible approach to actually virtualizing uxn.

At the foundation of a virtualization tree lies a real, physical machine, with real RAM and real I/O devices. That physical base machine has a physical RAM pointer. Every virtual machine in the tree implicitly accesses the physical RAM.

Each virtual machine also has private registers, and private memory. Recall we’re considering recursive virtualization, where the program a hypervisor supervises may in turn be a hypervisor.

  • Registers
    • 32-bit base, relative to its parent’s memory allocation
    • 32-bit bound (length of accessible memory)
    • 16-bit PC
    • a trapCode register
  • Memory, all of which MUST be disjoint from the allowed base-and-bound region of RAM
    • 256 bytes of device memory
    • two 255-byte stacks, plus two 1-byte stack pointers
    • two 32-byte device mask bitsets and one 16-entry × 16-bits device version vector
    • a 16-byte trap description area

The base and bound are new. Machines do not have access to them; they are only able to access the registers of “child” VMs. An exception is made to allow a VM to read its own bound register. It does this via the system device “expansion” port, using an operation getBound, described below.

The outermost VM has a base of 0 and a bound of however large the physical RAM is.

All memory-accessing operations (including loading an instruction) take the base and bound into account. Accesses falling outside physical RAM or outside the base-and-bound cause a memory fault.

The device mask bitsets still control device effects, but instead of calling a DEI/DEO handler routine as a subroutine of the VM’s evaluator, the evaluator traps, yielding control to the parent VM, which performs the requested action.

Every time a VM stops execution, its trapCode and trap description area are updated to describe the cause of the stop.

System device expansion operations

Adding the following operations is sufficient to allow writing a hypervisor in uxntal:

name operation fields
getBound #10 &boundHi $2 &boundLo $2
vmExec #11 &vmAddr $2


The handler retrieves the 32-bit active bound on memory access, overwriting the four bytes following the operation code.


The handler expects a VM Control Block (VMCB) to exist at vmAddr:

field name offset size (bytes) description
parentLink 0 4 scratch space for control transfer
baseHi 4 2 high short of base register
baseLo 6 2 low short of base register
boundHi 8 2 high short of bound register
boundLo 10 2 low short of bound register
pc 12 2 program counter
trapCode 14 2 trap code
trapDescription 16 16 trap detail record
deiMask 32 32 device input mask bitset
deoMask 64 32 device output mask bitset
devVers 96 32 device version vector
(reserved) 128 128
workStack 256 255 work stack base
workStackPtr 511 1 work stack pointer
returnStack 512 255 return stack base
returnStackPtr 767 1 return stack pointer
devMemory 768 256 device memory

The operation performs a context switch to the described VM. In the following, next denotes the VMCB pointed to by vmAddr, and curr the running VM’s VMCB.

The absolute base absBase is computed from the running VM’s base register and the base register in next. If absBase or absBase plus the next’s bound lies outside the curr’s base-and-bound, curr takes a memory fault. If any part of the next VMCB overlaps with the region addressed by absBase and next’s bound, curr takes a memory fault.1

If these checks all pass, curr’s VMCB address is placed in parentLink (it will not be visible to any VM; see below) and execution resumes in next.

Traps and faults

When a running VM takes a trap or a fault, it transfers control to its parent VM. The parent VM’s address is retrieved from the parentLink field of the running VMCB (curr). If it is 0, there is no parent, and this is the outermost VM; the pre-virtualization behavior applies. Otherwise, curr’s trapCode and trapDescription are set appropriately, its parentLink field is cleared (ensuring that the VM cannot learn anything about its absolute location), and the parent VM is resumed.

Modified instructions

The effect of BRK is altered: it takes a trap with trapCode of BRK and no description.

The effects of DEI and DEO are also altered. Instead of invoking a dei/deo function inside the evaluator, the machine takes a trap with trapCode set to IO. The suspended VM’s PC is left pointing after the DEI/DEO instruction; the arguments to DEI/DEO are included in the trapDescription.

Memory and stack faults

When an out-of-bounds access occurs, the stack underflows, or the stack overflows, the machine takes a fault with an appropriate trapCode, and with trapDescription including details of the fault. The machine’s PC and stack are left alone so that if the machine is immediately restarted, it will precisely repeat the fault. This could be used to e.g. provide a (small) virtual memory, or an (unbounded) virtual stack.


I have implemented emulator support for virtualization, as sketched above, here. I’ve also written a simple hypervisor called xun (like Xen, see?). To see the code, check out the virtuxn3 branch of the git repository.

The implementation is done in two stages:

  • the first stage (diffs from virtuxn3-base to virtuxn3-novirt) splits out “MMU” and IO “processors” into a “system-on-chip”, uxnsoc.[ch], updating the various core implementations to use the slightly-different interface to memory and device IO.

  • the second stage (diffs from virtuxn3-novirt to virtuxn3) is simpler, providing an alternative compatible “system-on-chip” implementation supporting the new system extension commands and providing the necessary machinery to swap in and out of nested VMs as required.

The hypervisor, xun, is a little program that is currently hardcoded to load a particular ROM, bin/xuninner.rom. It cannot yet recursively run itself running another program, and lacks configurability. It also doesn’t do anything about device multiplexing, like a real multi-tasking operating system would have to do. But the basics are there, enough to show it working: it can, for example, run piano.rom without problems.

Differences between direct and virtualized execution

Because the devices take virtual addresses, rather than physical addresses, a supervisor has to copy down media such as sprites and audio samples before asking the hardware to use them. This leads to noticeable differences in behavior: running piano.rom, sounding a note, and then scribbling on the waveform causes the playing note to change its timbre. When running the same program under xun, however, the note stays the same; the edit to the timbre doesn’t take effect until the next note is played.

One change that could ameliorate this would be to make the various addresses used in device memory into 32-bit physical addresses (16 bit page / 16-bit offset). That way, translated addresses from nested VMs could be used without having to copy them. (A second benefit would be the ability for ROMs to use media outside the bottom 64k without having to copy them first.)

Next steps

It’d be nice to adapt the devices to support 32-bit addresses for sprite and audio sample data etc. It can be done backward-compatibly; I need to write down the approach I have in mind. There’s also a bit of implementation work remaining in xun to allow full recursive virtualization; it’s, in some sense, a Trivial Matter of Programming, but the details are actually a little more fiddly than they have to be and perhaps some changes to the device interfaces could make things easier for hypervisors without complicating life for simple real-mode ROMs. There’s nascent support in baseline uxn for self-describing devices: I’d like to see this fleshed out a bit so that nested VMs could reliably discover the features made available to them by their hypervisor. Ultimately, it’d be amazing to build something a little System-7 like using virtualized uxn for running programs. The nested VM interface could even be widened to include a “fuel counter” of instructions before a “fuel exhausted” trap, allowing preemptive multitasking.

The end

All in all, uxn virtualizes very nicely. Writing a hypervisor entirely in uxntal works well with minimal extensions to the VM specification. I’m looking forward to seeing how things develop!

  1. A corollary of this is that a nested VM’s addressable memory must always be smaller (at least by the size of one VMCB) than its parent’s addressable memory. 

File distribution over DNS: (ab)using DNS as a CDN

This is the story of a one-afternoon hack that turned into a one-weekend hack.

I woke up on Saturday with a silly idea: what would it be like to use content-defined chunking (CDC) and serve the chunks over DNS? DNS caching could make scaleout and incremental updates quite efficient, at least in theory. Strong hashing gives robust download integrity. DNS replication gives you a kind of high availability, even!

After a coffee, I figured I may as well try it out.

TL;DR. It works, more or less, so long as your resolver properly upgrades to DNS-over-TCP when it gets a truncated UDP response. The immutable, strongly-named chunks are served in TXT records (!) and are cached by resolvers for a long time. This lowers load on the authoritative DNS server. Each stored “file” gets a user-friendly name for its root chunk in the form of a CNAME with a much shorter TTL.

You can try it out with:

docker run -i --rm leastfixedpoint/nscdn \
    nscdn get SEKIENAKASHITA.DEMO.NSCDN.ORG. > SekienAkashita.jpg

which downloads this image using nothing but DNS.

A demo server is running on the domain and the code is available at

How it works

The Ronomon variant of FastCDC, due to Joran Greef, is a JavaScript-friendly, 32-bit content-defined chunking algorithm that splits big files into chunks with a distribution of sizes having an average, minimum and maximum length.1

For this project I chose an 8k minimum size, a 16k average, and a 48k upper limit, because of the limitations involved in serving large amounts of data via DNS.

The core idea is to use the CDC algorithm to slice up a data file, and then construct a broad, shallow Merkle tree from it, serving each leaf and inner node of the tree as a separate DNS TXT record associated with a domain label including the Base32 encoding of the BLAKE2b hash of the data.

The example file from above, SekienAkashita.jpg, is split up into five chunks and one inner node that lists the chunks:


For a larger file—say, the Linux kernel—the index node itself would be longer than permitted for a chunk, so it would be split up itself and another level would be added to the tree to index the chunks of the lower-level node. This recursion can be repeated as required. I chose a 64-bit file size limit.

Storing a tree in DNS

A CNAME record from the human-usable name of each file (SEKIENAKASHITA.DEMO.NSCDN.ORG) points to the DNS record of the root node of the file’s Merkle tree.

Each tree node is stored as raw binary (!) in a DNS TXT record. Each record’s DNS label is formed from the type code of the node and the base32 encoding of the BLAKE2b hash of the binary content.

The content of leaf nodes (type 1) is just the binary data associated with the chunk. Each inner node (type 2) is a sequence of 64-byte “pointer” records containing a binary form of child nodes’ DNS labels, along with a length for each child. This allows random-access retrieval of subranges of each file.

Weirdly, this is all completely standards-compliant use of DNS. The only place I’m pushing the limits is the large-ish TXT records, effectively mandating use of DNS’s fallback TCP mode.

Serving the data: stick it in SQLite, tiny server, job done

Any ordinary DNS server can be used to serve the records from a domain under one’s control.

However, I decided to hack together my own little one that could serve records straight out of a SQLite database. I used it as an excuse to experiment with Golang again for the first time in more than a decade.2

I cobbled together a program called nscdnd which uses

Similarly, a little tool called nscdn allows insertion (nscdn add) and deletion (nscdn del) of files from a database, plus retrieval and reassembly of a file from a given domain name (nscdn get).

The bulk of the interesting code is in

Okay, but is this a good idea?

¯\_(ツ)_/¯ It works surprisingly well in the limited testing I’ve done. I doubt it’s the most efficient way to transfer files, but it’s not wildly unreasonable. The idea of getting the chunks cached by caching resolvers between clients and the authoritative server seems to work well: when I re-download something, it only hits the authoritative server for the short-lived CNAME and the root-node TXT record. The other nodes in the tree seem to be cached somewhat locally to my client.

Try it out yourself!

You can retrieve files from the demo server on, as previously mentioned:

docker run -i --rm leastfixedpoint/nscdn \
    nscdn get SEKIENAKASHITA.DEMO.NSCDN.ORG. > SekienAkashita.jpg

You can also run an nscdnd instance for a domain you control. All the following examples use docker, but you can just check out the repository and build it yourself too.

To run a server:

docker run -d -p 53:53 -p 53:53/udp \
    -v `pwd`/store.sqlite3:/data/store.sqlite3 \
    --env \
    --name nscdnd \

and add files to the store:

docker run -i --rm \
    -v `pwd`/store.sqlite3:/data/store.sqlite3 \
    leastfixedpoint/nscdn \
    nscdn add /data/store.sqlite3 SOMEFILENAME  < SomeFilename.bin

Then add an NS record pointing to it: 86400 IN NS

and retrieve your files:

docker run -i --rm leastfixedpoint/nscdn \
    nscdn get

You can also dig +short -t txt to get information about the running server.

For the server, it looks like this:

$ dig +short -t txt
"server=nscdnd" "version=v0.3.1" "SPDX-License-Identifier=AGPL-3.0-or-later" "source="

Finally, you can use dig to retrieve CNAME and TXT records without using the nscdn tool:



(Dig presents the TXT records using the slightly peculiar decimal escaping syntax from RFC 1035.)

Future directions

Compression of individual chunks. At the moment, chunks are served uncompressed. It’d be a friendly thing to do to compress the contents of each TXT record.

Experimenting with chunk sizes. Is the distribution of chunk sizes with the current parameters reasonable? Are smaller chunks required, operationally, given we’re kind of pushing DNS to its limits here? Could larger chunk sizes work?

How does it perform? How could performance be improved?

Garbage-collection of chunks in a store. At present, running nscdn del just removes the CNAME link to a root chunk. It doesn’t traverse the graph to remove unreferenced chunks from the store.

Incremental downloads, partial downloads, recovery/repair of files.

Statistics on sharing of chunks in a store. Say you used a store to distribute multiple releases of a piece of software. It’d be interesting to know how much sharing of chunks exists among the different release files.

Download and assemble files in the browser? So DNS-over-HTTPS is a thing. What happens if we use, for example, the Cloudflare DoH API to retrieve our chunks?

JSON.stringify(await (await fetch(
    { headers: { "accept": "application/dns-json" } })).json(), null, 2)
      "Status": 0,
      "TC": false, "RD": true, "RA": true, "AD": false, "CD": false,
      "Question": [{
          "name": "",
          "type": 16
      "Answer": [{
          "name": "",
          "type": 16,
          "TTL": 284,
          "data": "\"server=nscdnd\"\"version=v0.3.1\"\"SPDX-License-Identifier=AGPL-3.0-or-later\"\"source=\""

Hmm. Promising. Unfortunately, it doesn’t seem to like the TXT records having binary data in them (bug? certainly TXT records are allowed to hold binary data, see here and then here), so it might not Just Work. Perhaps a little redesign to use Base64 in the TXT record bodies is required; or perhaps the application/dns-message binary-format API would work.

Encryption of stored files. Perhaps something simple like age.

Insertion of data across the network. While Dynamic DNS is a thing, it’s not quite suitable for this purpose. Perhaps some way of inserting or deleting files other than the current ssh-to-the-server-and-run-a-command would be nice.

Content-defined chunking of index nodes. For this quick prototype, I split non-leaf nodes in a tree into 512-pointer chunks, but it’d be nicer to split them using some variation on the CDC algorithm already used to split the raw binary data in the file. That way, incremental updates would be more resilient to insertions and deletions.

Think more about Named Data Networking. Remember Named Data Networking (NDN)? It’s an alternative Internet architecture, initially kicked off by Van Jacobson and colleagues. (It used to be known as CCN, Content-Centric Networking.) This DNS hack is a cheap-and-nasty system that has quite a bit in common with the design ideas of NDN, though obviously it’s desperately unsophisticated by comparison.

The End

Anyway, I had a lot of fun hacking this together and relearning Go, though I was a little embarrassed when I found myself spending a lot of time at the beginning browsing for a domain to buy to show it off…

  1. The distribution of chunk sizes is pretty strange though. I don’t know in what sense the “average” actually is an average; the minimum and maximum cut-offs are enforced by the algorithm though. For more detail, see the FastCDC paper

  2. So I have, uh, criticisms of go. But it’s nothing that hasn’t been said before. Let’s just say that the experience reminded me of programming JavaScript in the bad old days before TC39 really got going. Overall the experience was “ok, I guess”.