NaCl/libsodium binding for Pharo and Squeak

I’ve just written Pharo/Squeak bindings to libsodium, which is a portable shared-library version of the NaCl cryptography library. A good description of the motivation of the library is this PDF.

Installing the software

To use the bindings, you will need to install the Monticello package Crypto-Nacl from The bindings depend on the FFI, so that must be installed.

From within Squeak:

(Installer repository: '')
   install: 'FFI-Pools';
   install: 'FFI-Kernel';
   install: 'FFI-Tests'.

(Installer repository: '')
   install: 'Crypto-Nacl'.

Most importantly, you will need a version of libsodium for your Smalltalk VM. Because most Squeak/Pharo VMs are 32-bit, you will need to get hold of a 32-bit libsodium. I’ve prebuilt some:

  • OS X, 32-bit VMs: libsodium.dylib.gz
    • decompress and put this in either
      • (note: no extension!) for the Squeak VM, or
      • (note: no extension!) for the Pharo VM.
  • Linux, 32-bit VMs:
    • decompress and place the file in the same directory as vm-display-X11 and friends.
Compiling your own libsodium (optional)

Compiling libsodium to work with Squeak/Pharo can be tricky:

  • On OS X, configure libsodium with ./configure CFLAGS=-m32 to build a 32-bit version.

  • On 32-bit linux, ordinary ./configure works just fine, but I haven’t yet managed to get things working on a 64-bit linux. If anyone tries 64-bit and manages to get it to work, please let me know!

  • I haven’t tried it on Windows at all. Please let me know if you try this, and how it goes, either success or failure.

Running the tests

On the Smalltalk side, once you’ve loaded the .mcz, open a Test Runner and select the Crypto-Nacl tests. With Crypto-Nacl-tonyg.4, there should be 12 tests, and they should all pass if the shared library can be found in the right place.

You can also try it out in a Workspace: a printIt of Nacl sodiumVersionString will yield '0.3' or '0.4.1', depending on which version of libsodium you have.


How to Build Racket on Windows 7

Here are the steps I followed to build Racket from a git checkout on a fresh Windows 7 installation.

  • Installed Visual Studio Express 2008 (not 2010 or 2012). This is hard to find on Microsoft’s website; this stackoverflow question linked to the Visual Studio Express 2008 ISO directly.

  • Used Virtual Clone Drive from SlySoft to mount the ISO in order to install Visual Studio, since Microsoft thoughtfully omitted ISO-mounting functionality from the core operating system.

  • Installed MASM32 to get a working assembler, since Microsoft thoughtfully omitted an assembler from their core compiler suite. (I am informed that later and/or non-Express editions of Visual Studio do actually include an assembler.)

  • Added the following directories to the system path:

    • C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin, for cl.exe etc.
    • C:\Program Files (x86)\Microsoft Visual Studio 9.0\Common7\IDE, for VCExpress.exe
    • C:\Program Files (x86)\Microsoft Visual Studio 9.0\Common7\Tools, for vsvars32.bat
    • C:\masm32\bin, for ml.exe

    They do have to appear in that order, in particular with the MASM directory last, since it includes a link.exe that will otherwise conflict with the Visual Studio linker.

  • Installed Github for Windows and checked out Racket.

  • Opened a cmd.exe shell. NOTE not a PowerShell instance. Somehow environment variables are propagated differently in PowerShell from in cmd.exe! You want to only try to use cmd.exe to build Racket with.

  • Ran vsvars32.bat in that shell. This step is important, as otherwise the start of the xform stage in the build will fail to find stdio.h.

  • Navigated to my Racket checkout within that shell, and from there to src/worksp. Ran build.bat.

Following these steps will in principle lead to a fresh Racket.exe living in the root directory of your Racket checkout. There are many, many things that could go wrong however. Good luck!

What I do all day

Some people tell computers what to do. When they do, they have to talk about what the computer is to do each time it hears something new. They can’t easily talk about how the computer can learn which other computers are in the area. The computers can’t see each other, so usually computers find each other by talking to each other and listening to see if anyone talks back. Exactly how this should work is hard to explain to the computer.

My work is to try to find a way to let computers see each other. I want to let them see which other computers are near without the need for them to talk to each other about who is around and who is not all the time. That way we can tell computers how to do their job without needing to explain, over and over, the hard stuff about how they can learn which other computers are near.

We will be able to tell computers not only what to do when they hear something new, but also what to do when they see their friends come and go. It will be quicker and easier to tell computers how to do their jobs right, and harder to tell them wrong things to do.

I used this to make sure I only used allowed words. See also!

Crude benchmarks of NaCl and scrypt in the browser

As I just wrote, I’ve ported libraries for cryptography (js-nacl) and password-based key derivation (js-scrypt) to Javascript.

Some browsers are faster at running these cryptographic routines than others. The results below are from casual (nay, unscientific!) speed measurements in the browsers I had handy on my machine.

The setup:

  • Chrome 26.0.1410.43
  • Safari 5.1.8 (6534.58.2)
  • Aurora 21.0a2 (2013-03-30)
  • Firefox 19.0.2
  • Macbook Air late 2010 (3,1), 1.6 GHz Core 2 Duo, 4 GB RAM, OS X 10.6.8

I had to exclude Firefox from the nacl tests, since it lacks window.crypto.getRandomValues.

Hashing strings/bytes with SHA-512

Here we see Safari has the edge. Aurora is oddly slow.

Hash operations (per sec)

Computing random nonces

This is a thin wrapper over window.crypto.getRandomValues. Safari wins hands-down here. I wonder how good the generated randomness is?

Random nonce generation (per sec)

Authenticated encryption using a shared key

These are Salsa20/Poly1305 authenticated encryptions using a precomputed shared key. Broadly speaking, boxing was quicker than unboxing. The browsers perform roughly equally here.

Secret-key operations (per sec)

Computing a shared key from public/private keys

These are operations whose runtime is dominated by the computation of a Curve25519 operation. In three of the four cases, the operation is used to compute a Diffie-Hellman shared key from a public key and a secret key; in the remaining case (crypto_box_keypair_from_seed) it is used to compute a public key from a secret key. Chrome is significantly faster than the other browsers here.

Shared-key computations (per sec)

scrypt Key Derivation Function

Here, Safari is the only browser that underperforms significantly. The other three all compute an scrypt-derived key in 2–4 seconds, using defaults suggested by the scrypt paper as being suitable for interactive login.

scrypt() calls per second


scrypt is slow. Precompute Diffie-Hellman shared keys where you can.

NaCl and scrypt in the Browser (and node.js)

I’ve produced Emscripten-compiled variants of both NaCl, a cryptographic library, and scrypt, a password-based key derivation function.

  • js-nacl (documentation) includes support both for the browser and for node.js.

  • js-scrypt (documentation) supports just the browser, since there are plenty of existing, faster alternatives for scrypt for node.js.

I’m looking forward to exploring some of the possible applications of combining the two libraries!

One important missing piece is certificates; for this, dusting off SPKI might prove interesting.

Chord-style DHTs don't fairly balance their load without a fight

Chord-style DHT ring

Figure 1: The ring structure of a Chord-style DHT, with nodes placed around the ring.

I’ve been experimenting with Distributed Hash Tables (DHTs) recently. A common base for DHT designs is Chord [1], where nodes place themselves randomly onto a ring-shaped keyspace; figure 1 shows an example.

One of the motivations of a DHT is to fairly distribute keyspace among participating nodes. If the distribution is unbalanced, some nodes will be doing an unfairly large amount of work, and will be storing an unfairly large amount of data on behalf of their peers.

This post investigates exactly how unfair it can get (TL;DR: very), and examines some options for improving the fairness of the span distribution (TL;DR: virtual nodes kind of work).

Background: Chord-style DHTs

Chord-style DHTs map keys to values. Keys are drawn from a k-bit key space. It’s common to imagine the whole key space, between 0 and 2k-1, as a ring, as illustrated in figure 1.

Each node in the DHT takes responsibility for some contiguous span of the key space. A newly-created node chooses a random number between 0 and 2k-1, and uses it as its node ID. When a node joins the DHT, one of the first things it does is find its predecessor node. Once it learns its predecessor’s node ID, it takes responsibility for keys in the span (pred, self] between its predecessor’s ID and its own ID [6].

If the keys actually in use in a particular DHT are uniformly distributed at random, then the load on a node is directly proportional to the fraction of the ring’s keyspace it takes responsibility for. I’ll assume this below without further comment.

Below, I’ll use n to indicate the number of nodes in a particular DHT ring.

Expected size of a node’s region of responsibility

My intuitions for statistics are weak, so I wrote a short and ugly program to experimentally explore the distribution of responsibility interval sizes. My initial assumption (naive, I know) was that the sizes of intervals between adjacent nodes would be normally distributed.

Responsibility interval distribution

Figure 2: Distribution of distances between adjacent nodes, from a simulated 10,000-node DHT. The mean interval length, 1/10,000 = 10-4, is marked with a vertical line. The median (not marked) is 0.69×10-4: less than the mean. The blue curve is the exponential distribution with β=10-4.

I was wrong! The actual distribution, confirmed by experiment, is exponential1 (figure 2).

In an exponential distribution, the majority of intervals are shorter than the expected length of 1/n. A few nodes are given a very unfair share of the workload, with an interval much longer than 1/n.

How bad can it get?

The unfairness can be as bad as a factor of O(log n). For example, in DHTs with 104 nodes, nodes have to be prepared to store between ten and fifteen times as many key/value pairs than the mean.

Improving uniformity, while keeping random node ID allocation

How can we avoid this unfairness?

One idea is to introduce many virtual nodes per physical node: to let a physical node take multiple spots on the ring, and hence multiple shorter keyspace intervals of responsibility. It turns out that this works, but at a cost.

If each physical node takes k points on the ring, we end up with kn intervals, each of expected length 1/kn. The lengths of these shorter intervals are exponentially distributed. Each node takes k of them, so to figure out how much responsibility, and hence how much load, each node will have, we need the distribution describing the sum of the interval lengths.

An Erlang distribution gives us exactly what we want. From Wikipedia:

[The Erlang distribution] is the distribution of the sum of k independent exponential variables with mean μ.

We’ve already (figure 2) seen what happens when k=1. The following table (figure 3) shows the effect of increasing k: subdividing the ring into shorter intervals, and then taking several of them together to be the range of keys each node is responsible for.

k 2 3 4 5
k 10 15 20 25
k 50 75 100 125
Figure 3. Effects of increasing k. The green areas show results from simulation; the blue curve overlaid on each plot is the Erlang distribution with μ=1/kn. The mean interval length, 1/10,000, is marked with a vertical line.

We see that the distribution of load across nodes gets fairer as k increases, becoming closer and closer to a normal distribution with on average 1/n of the ring allocated to each node.

The distribution gets narrower and narrower as k gets large. Once k is large enough, we can plausibly use a normal approximation to the Erlang distribution. This lets us estimate the standard deviation for large k to be 1/sqrt(k) of the expected allocation size.

That is, every time we double k, the distribution only gets sqrt(2) times tighter. Considering that the DHT’s ring maintenance protocol involves work proportional to k, it’s clear that beyond a certain point a high virtual-to-physical node ratio becomes prohibitive in terms of ring maintenance costs.

Furthermore, lookups in Chord-style DHTs take O(log n) hops on average. Increasing the number of nodes in the ring by a factor of k makes lookup take O(log n + log k) hops.

Where do these distributions come from in the first place?

Imagine walking around the ring, clockwise from key 0. Because we choose node IDs uniformly at random, then as we walk at a constant rate, we have a constant probability of stumbling across a node each step we take. This makes meeting a node on our walk a Poisson process. Wikipedia tells us that the exponential distribution “describes the time between events in a Poisson process,” and that summing multiple independent exponentially-distributed random variables gives us an Erlang distribution, so here we are.


The distribution of responsibility for segments of the ring among the nodes in a randomly-allocated Chord-style DHT is unfair, with some nodes receiving too large an interval of keyspace.

Interval lengths in such a ring follow an exponential distribution. Increasing the number of virtual nodes per physical node leads to load allocation following an Erlang distribution, which improves the situation, but only at a cost of increased ring-maintenance overhead and more hops in key lookups.

OK, so it turns out other people have already thought all this through

I should have Googled properly for this stuff before I started thinking about it.

A couple of easily-findable papers [2,3] mention the exponential distribution of interval sizes. Some (e.g. [3]) even mention the gamma distribution, which is the generalization of the Erlang distribution to real k. There are some nice measurements of actual DHT load closely following an exponential distribution in [4], which is a paper on improving load balancing in Chord-style DHTs. Karger and Ruhl [5] report on a scheme for maintaining a bunch of virtual nodes per physical node, but avoiding the ring-maintenance overhead of doing so by only activating one at a time.

Cassandra and its virtual nodes

Cassandra, a NoSql database using a DHT for scalability, defaults to 256 virtual nodes (vnodes) per physical node in recent releases. The available information (e.g. this and this) suggests this is for quicker recovery on node failure, and not primarily for better load-balancing properties. From this document I gather that this is because Cassandra never used to randomly allocate node IDs: you used to choose them up-front. New releases do choose node IDs randomly though. Here, they report some experience with the new design:

However, variations of as much as 7% have been reported on small clusters when using the num_tokens default of 256.

This isn’t so surprising, given that our Erlang distribution tells us that choosing k=256 should yield a standard deviation of roughly 1/16 of the expected interval size.


[1] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan, “Chord: A scalable peer-to-peer lookup service for internet applications,” in ACM SIGCOMM, 2001. PDF available online.

[2] H. Niedermayer, S. Rieche, K. Wehrle, and G. Carle, “On the Distribution of Nodes in Distributed Hash Tables,” in Proc. Workshop Peer-to-Peer-Systems and Applications (KiVS), 2005. PDF available online.

[3] N. Xiaowen, L. Xianliang, Z. Xu, T. Hui, and L. Lin, “Load distributions of some classic DHTs,” J. Systems Engineering and Electronics 20(2), pp. 400–404, 2009. PDF available online.

[4] S. Rieche, L. Petrak, and K. Wehrle. A Thermal-Dissipation-based Approach for Balancing Data Load in Distributed Hash Tables. In Proc. of IEEE Conference on Local Computer Networks. (LCN 2004), Tampa, FL, USA, November 2004. PDF available online.

[5] D. Karger and M. Ruhl, “Simple efficient load balancing algorithms for peer-to-peer systems,” Proc. Symp. Parallelism in Algorithms and Architectures (SPAA), 2004. PDF available online.

[6] B. Mejías Candia, “Beernet: A Relaxed Approach to the Design of Scalable Systems with Self-Managing Behaviour and Transactional Robust Storage,” PhD thesis, École Polytechnique de Louvain, 2010. PDF available online.

  1. Actually, it’s a geometric distribution because it’s discrete, but the size of the keyspace is so large that working with a continuous approximation makes sense.

A calling convention for ARM that supports proper tail-calls efficiently

Because proper tail calls are necessary for object-oriented languages, we can’t quite use the standard calling conventions unmodified when compiling OO languages efficiently to ARM architectures.

Here’s one approach to a non-standard, efficient, tail-call-supporting calling convention that I’ve been exploring recently.

The big change from the standard is that we do not move the stack pointer down over outbound arguments when we make a call.

Instead, the callee moves the stack pointer as they see fit. The reason for this is so that the callee can tail-call someone else without having to do any hairy adjusting of the frame, and so that the original caller doesn’t have to know anything about what’s left to clean up when they receive control: all the clean-up has already been completed.

This bears stating again: just after return from a subroutine, all clean-up has already been completed.

In the official standard, the stack space used to communicate arguments to a callee is owned by the caller. In this modified convention, that space is owned by the callee as soon as control is transferred.

Other aspects of the convention are similar to the AAPCS standard:

  • keep the stack Full Descending, just like the standard.
  • ensure it is 8-byte aligned at all times, just like (a slight restriction of) the standard.
  • make outbound arguments leftmost-low in memory, that is, “pushed from right to left”. This makes the convention compatible with naive C struct overlaying of memory.
  • furthermore, ensure argument 0 in memory is also 8-byte aligned.

Details of the stack layout

Consider compiling a single subroutine, either a leaf or a non-leaf routine. We need to allocate stack space to incoming arguments, to saved temporaries, to outbound arguments, and to padding so we maintain the correct stack alignment. Let

  • Ni = inward-arg-count, the number of arguments the routine expects
  • No = most-tail-args, the largest number of outbound tail-call arguments the routine produces
  • Nt = inward-temp-count, the number of temps the routine requires
  • Na = outward-arg-count, the number of arguments supplied in a particular call the routine makes to some other routine

Upon entry to the routine, where Ni=5, No=7, Nt=3, Na=3, we have the following stack layout. Recall that stacks are full-descending.

(low)                                                               (high)
    | outbound  |   |   temps   |   |shuffle|      inbound      |
    | 0 | 1 | 2 |---| 0 | 1 | 2 |---| - | - | 0 | 1 | 2 | 3 | 4 |---|
                    ^                                               ^
                  sp for non-leaf                                sp for leaf

I’ve marked two interesting locations in the stack: the position of the stack pointer for leaf routines, and the position of the stack pointer for non-leaf routines, which need some space of their own to store their internal state at times when they delegate to another routine. Leaf routines simply leave the stack pointer in place as they start execution; non-leaf routines adjust the stack pointer themselves as control arrives from their caller.

Note that the first four arguments are transferred in registers, but that stack slots still need to be reserved for them. Note also the padding after the outbound arguments, the temps, and the inbound/shuffle-space.

The shuffle-space is used to move values around during preparation for a tail call whenever the routine needs to supply more arguments to the tail-called routine than it received in turn from its caller.

The extra shuffle slots are only required if there’s no room in the inbound slots plus padding. For example, if Ni=5 and No=6, then since we expect the inbound arguments to have one slot of padding, that slot can be used as shuffle space.

Addressing calculations

Leaf procedures do not move the stack pointer on entry. Nonleaf procedures do move the stack pointer on entry. This means we have different addressing calculations depending on whether we’re a leaf or nonleaf procedure.

  • Pad8(x) = x rounded up to the nearest multiple of 8.
  • sp_delta = Pad8(No * 4) + Pad8(Nt * 4), the distance SP might move on entry and exit.

Leaf procedures, where the stack pointer does not move on entry to the routine:

inward(n) = rn, if n < 4
          | sp - Pad8(Ni * 4) + (n * 4)
temp(n) = sp - sp_delta + (n * 4)
outward(n) (tail calls only) = rn, if n < 4
                             | sp - Pad8(Na * 4) + (n * 4)

Nonleaf procedures, where the stack pointer moves down by sp_delta bytes on entry to the routine:

inward(n) = rn, if n < 4
          | sp + sp_delta - Pad8(Ni * 4) + (n * 4)
temp(n) = sp + (n * 4)
outward(n) (non-tail calls) = rn, if n < 4
                            | sp - Pad8(Na * 4) + (n * 4)
outward(n) (tail calls) = rn, if n < 4
                        | sp + sp_delta - Pad8(Na * 4) + (n * 4)


This convention doesn’t easily support varargs. One option would be to sacrifice simple C struct overlaying of the inbound argument stack area, flipping arguments so they are pushed from left to right instead of from right to left. That way, the first argument is always at a known location.

Another option would be to use an argument count at the end of the argument list in the varargs case. This requires both the caller and callee to be aware that a varargs-specific convention is being used.

Of course, varargs may not even be required: instead, a vector could be passed in as a normal argument. Whether this makes sense or not depends on the language being compiled.

Successors to "Enterprise Integration Patterns"?

“Enterprise Integration Patterns”, by Gregor Hohpe, has been a classic go-to volume for a lot of people working with distributed systems over the years.

It was published back in 2002, though, before things like AMQP, ZeroMQ, Websockets and Twitter.

Is there anything that could be considered an update on the book? Something that covers modern integration scenarios. Perhaps something that touches not only on the newer messaging technologies but also on NoSQL, improvements to the browser environment, and so on.

What are people reading to get a common vocabulary for all this stuff and to get their heads around how the pieces fit together?

Mac OS X gripes

I’ve been using a Mac as my personal computer since late 2003. It’s been fine for all of that time, but recent releases of the software are starting to make me want to go back to Debian or Ubuntu, warts and all.

  • Every time I restart my machine, it forgets my trackpad settings.

  • About one time in ten I wake my machine from sleep, it shows the beachball forever and never comes back.

Combine that with the lack of a compiler shipped with the machine by default, and Debian is starting to look downright attractive again.

'bad_vertex' errors while developing and testing RabbitMQ plugins

Today I have been doing maintenance work on an old RabbitMQ plugin of mine. Part of this work was updating its Makefiles to work with the latest RabbitMQ build system.

The problem and symptom

After getting it to compile, and trying to run it, I started seeing errors like this:

Error: {'EXIT',

Not just from rabbitmq-plugins, but also a similar error when starting the RabbitMQ server itself.

The reason turned out to be simple: I had symbolically linked the rabbitmq-mochiweb and mochiweb-wrapper directories into my plugins directory, as per the manual, but what the manual didn’t say was that this works for all plugins except the -wrapper plugins (and the Erlang “client” plugin, rabbitmq-erlang-client a.k.a. amqp_client.ez).

The solution

Symlink all the plugins except the wrapper plugins and amqp_client.ez.

The wrapper plugin *.ez files and amqp_client.ez need to be present in the plugins directory itself. So instead of the instructions given, try the following steps:

$ mkdir -p rabbitmq-server/plugins
$ cd rabbitmq-server/plugins
$ ln -s ../../rabbitmq-mochiweb
$ cp rabbitmq-mochiweb/dist/mochiweb*ez .
$ cp rabbitmq-mochiweb/dist/webmachine*ez .
$ ../scripts/rabbitmq-plugins enable rabbitmq_mochiweb

A working configuration for me has the following contents of the plugins directory:

total 816
-rw-r--r--   1 tonyg  staff  260123 Sep 17 18:36 mochiweb-2.3.1-rmq0.0.0-gitd541e9a.ez
lrwxr-xr-x   1 tonyg  staff      23 Sep 17 17:59 rabbitmq-mochiweb -> ../../rabbitmq-mochiweb
-rw-r--r--   1 tonyg  staff  149142 Sep 17 18:37 webmachine-1.9.1-rmq0.0.0-git52e62bc.ez