Minimart and Network Calculus at (fourth Racketcon)

Here’s me giving a talk about my research on Minimart and Network Calculus at (fourth Racketcon) on the 20th of September, 2014:

How to Run the RabbitMQ Tests

RabbitMQ includes a suite of functional tests, as well as unit tests for each of its components. It’s not immediately obvious how to run the functional test suite, but fortunately, it’s straightforward.

You will need:

  • a Java JDK
  • Ant
  • Erlang
  • git
  • and I guess a Unix machine of some description; I don’t imagine running these tests on Windows will work well

Running the complete test suite takes about six minutes on my eight-core, 32GB i7 Linux box. Most of this time is spent sleeping, to ensure particular interleavings required by particular test cases.

Clone the repositories
$ git clone git://
$ cd rabbitmq-public-umbrella
$ git clone git://
$ git clone git://
$ git clone git://
$ git clone git://
Build and start the server in one window
$ cd rabbitmq-server
$ make run
Run the tests in another window

The tests are written in Java and come along with the RabbitMQ Java client source code.

Be sure to wait until the server has fully initialised itself before starting ant.

$ cd rabbitmq-java-client
$ ant test-suite

Things should tick along nicely for a few minutes at this point. From time to time, you’ll see the server print a new banner to its console: the tests restart the server occasionally as part of their normal operation.

Check the results

The tests leave their output in rabbitmq-java-client/build/TEST-* files. Each test suite (FunctionalTests, ClientTests, ServerTests, HATests) produces both a plain-text and an XML file summarising the results of the run.

How best to do PGP/GPG key management with Git?

I have parts of my home directory stored in git. In particular, I have my .gnupg/pubring.gpg and .gnupg/trustdb.gpg files in git. This is fine so long as absolutely no branching ever takes place.

My question to you, O lazyweb, is: does a tool exist for merging pubring.gpg etc. in a sensible fashion?

If the answer is “no”, then does a tool exist for exploding the contents of pubring.gpg etc into a directory full of tiny text files which are amenable to standard 3-way-merge?

New benchmarks of NaCl and scrypt in the browser

Back in March, I (crudely) measured the performance of js-nacl and js-scrypt running in the Browser.

Since then, emscripten has improved its code generation significantly, so I’ve rerun my tests. I’ve also thrown in node.js for comparison.1

The takeaway is: asm.js makes a big difference to NaCl and scrypt performance - on Firefox, that is. Firefox is between 2× and 8× faster.2 Other browsers have benefited from the improvements, but not as much.

The setup:

  • Firefox/23.0
  • Chrome/28.0.1500.95
  • Safari/534.59.8
  • node/v0.10.15
  • Macbook Air late 2010 (3,1), 1.6 GHz Core 2 Duo, 4 GB RAM, OS X 10.6.8

I’m running emscripten at git revision b1eaf55e (sources), of 8 August 2013.

(The benchmarks I ran in March were run with rev 4e09482e (sources) of 16 Jan 2013.)

What has changed since the last measurements?

Emscripten’s support for asm.js code generation is much better, and I am also now able to turn on -O2 optimization without things breaking.

On the minus side, the previous builds included a severe memory leak (!) because by default Emscripten includes a stub malloc() and free() implementation that never releases any allocated memory. The current builds include dlmalloc(), and so no longer leak memory, but run ever so slightly slower by comparison to using the stub allocator.

Safari seems to have severe problems with the current builds. I’m unsure where the bug lies (probably emscripten?), but many calls to crypto_box/crypto_box_open and scrypt() yield incorrect results. There are missing entries in the charts below because of this. (No sense in measuring something that isn’t correct.)

Since the previous round, Firefox has gained support for window.crypto.getRandomValues. Hooray!

Hashing (short) strings/bytes with SHA-512

Firefox handily dominates here, making the others look roughly equivalent.

Hash operations (per sec)

(Approximate speedups since January: Chrome = 1.2×; Firefox = 8.3×; Safari = 1.4×; node = 0.95×. Chart.)

Computing a shared key from public/private keys

These are operations whose runtime is dominated by the computation of a Curve25519 operation. In three of the four cases, the operation is used to compute a Diffie-Hellman shared key from a public key and a secret key; in the remaining case (crypto_box_keypair_from_seed) it is used to compute a public key from a secret key. Firefox again dominates, but Chrome is not far behind.

This is one of the areas where Safari yields incorrect results, leading to a missing data point. I’m not yet sure what the cause is.

Shared-key computations (per sec)

(Approximate speedups since January: Chrome = 2.5×; Firefox = 4.8×; Safari = 2×; node = 1×. Chart.)

Computing random nonces

This is a thin wrapper over window.crypto.getRandomValues, or the node.js equivalent, and so has not benefited from the emscripten improvements. I’m including it just to give a feel for how fast randomness-generation is.

Safari wins hands-down here. I wonder how good the generated randomness is?

Random nonce generation (per sec)

(Approximate speedups since January: Chrome = 0.96×; Firefox = 1×; Safari = 1×; node = 1×. Chart.)

Authenticated encryption using a shared key

These are Salsa20/Poly1305 authenticated encryptions using a precomputed shared key. Broadly speaking, boxing was quicker than unboxing. Firefox again dominates.

This another of the areas where Safari yields incorrect results. I’m not yet sure why.

Secret-key operations (per sec)

(Approximate speedups since January: Chrome = 1.5×; Firefox = 2.3×; Safari = 1.3×; node = 1.2×. Chart.)

Producing and validating signatures

These operations compute an elliptic-curve operation, but use the result to produce a digital signature instead of an authenticated/encrypted box. Signature generation is much faster than signature validation here. As for the other elliptic-curve-heavy operations, Firefox is fastest, but Chrome is not far behind.

Signature operations (per sec)

(Approximate speedups since January: Chrome = 1.8×; Firefox = 4.7×; Safari = 3.8×; node = 0.82×. Chart.)

scrypt Key Derivation Function

Here, Safari not only underperforms significantly but computes incorrect results. As above, I’m not sure why.

Firefox is about twice as fast as previously at producing scrypt()-derived keys. Both Firefox and Chrome are usably fast.

Signature operations (per sec)

(Approximate speedups since January: Firefox improved by around 2×; the others were roughly unchanged. Chart.)


scrypt is still slow. Safari has problems with this code, or vice versa. Precompute Diffie-Hellman shared keys where you can. Emscripten + asm.js = very cool!

  1. Of course, it’s a bit silly to include node.js here, since it can simply link against libsodium and get native-speed, optimized routines. Perhaps a Firefox extension could include a native XPCOM component offering a similar speed boost. 

  2. The benefit is not quite as much as I claimed it was based on eyeballing the numbers. 

NaCl/libsodium binding for Pharo and Squeak

I’ve just written Pharo/Squeak bindings to libsodium, which is a portable shared-library version of the NaCl cryptography library. A good description of the motivation of the library is this PDF.

Installing the software

To use the bindings, you will need to install the Monticello package Crypto-Nacl from The bindings depend on the FFI, so that must be installed.

From within Squeak:

(Installer repository: '')
   install: 'FFI-Pools';
   install: 'FFI-Kernel';
   install: 'FFI-Tests'.

(Installer repository: '')
   install: 'Crypto-Nacl'.

Most importantly, you will need a version of libsodium for your Smalltalk VM. Because most Squeak/Pharo VMs are 32-bit, you will need to get hold of a 32-bit libsodium. I’ve prebuilt some:

  • OS X, 32-bit VMs: libsodium.dylib.gz
    • decompress and put this in either
      • (note: no extension!) for the Squeak VM, or
      • (note: no extension!) for the Pharo VM.
  • Linux, 32-bit VMs:
    • decompress and place the file in the same directory as vm-display-X11 and friends.
Compiling your own libsodium (optional)

Compiling libsodium to work with Squeak/Pharo can be tricky:

  • On OS X, configure libsodium with ./configure CFLAGS=-m32 to build a 32-bit version.

  • On 32-bit linux, ordinary ./configure works just fine, but I haven’t yet managed to get things working on a 64-bit linux. If anyone tries 64-bit and manages to get it to work, please let me know!

  • I haven’t tried it on Windows at all. Please let me know if you try this, and how it goes, either success or failure.

Running the tests

On the Smalltalk side, once you’ve loaded the .mcz, open a Test Runner and select the Crypto-Nacl tests. With Crypto-Nacl-tonyg.4, there should be 12 tests, and they should all pass if the shared library can be found in the right place.

You can also try it out in a Workspace: a printIt of Nacl sodiumVersionString will yield '0.3' or '0.4.1', depending on which version of libsodium you have.


How to Build Racket on Windows 7

Here are the steps I followed to build Racket from a git checkout on a fresh Windows 7 installation.

  • Installed Visual Studio Express 2008 (not 2010 or 2012). This is hard to find on Microsoft’s website; this stackoverflow question linked to the Visual Studio Express 2008 ISO directly.

  • Used Virtual Clone Drive from SlySoft to mount the ISO in order to install Visual Studio, since Microsoft thoughtfully omitted ISO-mounting functionality from the core operating system.

  • Installed MASM32 to get a working assembler, since Microsoft thoughtfully omitted an assembler from their core compiler suite. (I am informed that later and/or non-Express editions of Visual Studio do actually include an assembler.)

  • Added the following directories to the system path:

    • C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin, for cl.exe etc.
    • C:\Program Files (x86)\Microsoft Visual Studio 9.0\Common7\IDE, for VCExpress.exe
    • C:\Program Files (x86)\Microsoft Visual Studio 9.0\Common7\Tools, for vsvars32.bat
    • C:\masm32\bin, for ml.exe

    They do have to appear in that order, in particular with the MASM directory last, since it includes a link.exe that will otherwise conflict with the Visual Studio linker.

  • Installed Github for Windows and checked out Racket.

  • Opened a cmd.exe shell. NOTE not a PowerShell instance. Somehow environment variables are propagated differently in PowerShell from in cmd.exe! You want to only try to use cmd.exe to build Racket with.

  • Ran vsvars32.bat in that shell. This step is important, as otherwise the start of the xform stage in the build will fail to find stdio.h.

  • Navigated to my Racket checkout within that shell, and from there to src/worksp. Ran build.bat.

Following these steps will in principle lead to a fresh Racket.exe living in the root directory of your Racket checkout. There are many, many things that could go wrong however. Good luck!

What I do all day

Some people tell computers what to do. When they do, they have to talk about what the computer is to do each time it hears something new. They can’t easily talk about how the computer can learn which other computers are in the area. The computers can’t see each other, so usually computers find each other by talking to each other and listening to see if anyone talks back. Exactly how this should work is hard to explain to the computer.

My work is to try to find a way to let computers see each other. I want to let them see which other computers are near without the need for them to talk to each other about who is around and who is not all the time. That way we can tell computers how to do their job without needing to explain, over and over, the hard stuff about how they can learn which other computers are near.

We will be able to tell computers not only what to do when they hear something new, but also what to do when they see their friends come and go. It will be quicker and easier to tell computers how to do their jobs right, and harder to tell them wrong things to do.

I used this to make sure I only used allowed words. See also!

Crude benchmarks of NaCl and scrypt in the browser

As I just wrote, I’ve ported libraries for cryptography (js-nacl) and password-based key derivation (js-scrypt) to Javascript.

Some browsers are faster at running these cryptographic routines than others. The results below are from casual (nay, unscientific!) speed measurements in the browsers I had handy on my machine.

The setup:

  • Chrome 26.0.1410.43
  • Safari 5.1.8 (6534.58.2)
  • Aurora 21.0a2 (2013-03-30)
  • Firefox 19.0.2
  • Macbook Air late 2010 (3,1), 1.6 GHz Core 2 Duo, 4 GB RAM, OS X 10.6.8

I had to exclude Firefox from the nacl tests, since it lacks window.crypto.getRandomValues.

Hashing strings/bytes with SHA-512

Here we see Safari has the edge. Aurora is oddly slow.

Hash operations (per sec)

Computing random nonces

This is a thin wrapper over window.crypto.getRandomValues. Safari wins hands-down here. I wonder how good the generated randomness is?

Random nonce generation (per sec)

Authenticated encryption using a shared key

These are Salsa20/Poly1305 authenticated encryptions using a precomputed shared key. Broadly speaking, boxing was quicker than unboxing. The browsers perform roughly equally here.

Secret-key operations (per sec)

Computing a shared key from public/private keys

These are operations whose runtime is dominated by the computation of a Curve25519 operation. In three of the four cases, the operation is used to compute a Diffie-Hellman shared key from a public key and a secret key; in the remaining case (crypto_box_keypair_from_seed) it is used to compute a public key from a secret key. Chrome is significantly faster than the other browsers here.

Shared-key computations (per sec)

scrypt Key Derivation Function

Here, Safari is the only browser that underperforms significantly. The other three all compute an scrypt-derived key in 2–4 seconds, using defaults suggested by the scrypt paper as being suitable for interactive login.

scrypt() calls per second


scrypt is slow. Precompute Diffie-Hellman shared keys where you can.

NaCl and scrypt in the Browser (and node.js)

I’ve produced Emscripten-compiled variants of both NaCl, a cryptographic library, and scrypt, a password-based key derivation function.

  • js-nacl (documentation) includes support both for the browser and for node.js.

  • js-scrypt (documentation) supports just the browser, since there are plenty of existing, faster alternatives for scrypt for node.js.

I’m looking forward to exploring some of the possible applications of combining the two libraries!

One important missing piece is certificates; for this, dusting off SPKI might prove interesting.

Chord-style DHTs don't fairly balance their load without a fight

Chord-style DHT ring

Figure 1: The ring structure of a Chord-style DHT, with nodes placed around the ring.

I’ve been experimenting with Distributed Hash Tables (DHTs) recently. A common base for DHT designs is Chord [1], where nodes place themselves randomly onto a ring-shaped keyspace; figure 1 shows an example.

One of the motivations of a DHT is to fairly distribute keyspace among participating nodes. If the distribution is unbalanced, some nodes will be doing an unfairly large amount of work, and will be storing an unfairly large amount of data on behalf of their peers.

This post investigates exactly how unfair it can get (TL;DR: very), and examines some options for improving the fairness of the span distribution (TL;DR: virtual nodes kind of work).

Background: Chord-style DHTs

Chord-style DHTs map keys to values. Keys are drawn from a k-bit key space. It’s common to imagine the whole key space, between 0 and 2k-1, as a ring, as illustrated in figure 1.

Each node in the DHT takes responsibility for some contiguous span of the key space. A newly-created node chooses a random number between 0 and 2k-1, and uses it as its node ID. When a node joins the DHT, one of the first things it does is find its predecessor node. Once it learns its predecessor’s node ID, it takes responsibility for keys in the span (pred, self] between its predecessor’s ID and its own ID [6].

If the keys actually in use in a particular DHT are uniformly distributed at random, then the load on a node is directly proportional to the fraction of the ring’s keyspace it takes responsibility for. I’ll assume this below without further comment.

Below, I’ll use n to indicate the number of nodes in a particular DHT ring.

Expected size of a node’s region of responsibility

My intuitions for statistics are weak, so I wrote a short and ugly program to experimentally explore the distribution of responsibility interval sizes. My initial assumption (naive, I know) was that the sizes of intervals between adjacent nodes would be normally distributed.

Responsibility interval distribution

Figure 2: Distribution of distances between adjacent nodes, from a simulated 10,000-node DHT. The mean interval length, 1/10,000 = 10-4, is marked with a vertical line. The median (not marked) is 0.69×10-4: less than the mean. The blue curve is the exponential distribution with β=10-4.

I was wrong! The actual distribution, confirmed by experiment, is exponential1 (figure 2).

In an exponential distribution, the majority of intervals are shorter than the expected length of 1/n. A few nodes are given a very unfair share of the workload, with an interval much longer than 1/n.

How bad can it get?

The unfairness can be as bad as a factor of O(log n). For example, in DHTs with 104 nodes, nodes have to be prepared to store between ten and fifteen times as many key/value pairs than the mean.

Improving uniformity, while keeping random node ID allocation

How can we avoid this unfairness?

One idea is to introduce many virtual nodes per physical node: to let a physical node take multiple spots on the ring, and hence multiple shorter keyspace intervals of responsibility. It turns out that this works, but at a cost.

If each physical node takes k points on the ring, we end up with kn intervals, each of expected length 1/kn. The lengths of these shorter intervals are exponentially distributed. Each node takes k of them, so to figure out how much responsibility, and hence how much load, each node will have, we need the distribution describing the sum of the interval lengths.

An Erlang distribution gives us exactly what we want. From Wikipedia:

[The Erlang distribution] is the distribution of the sum of k independent exponential variables with mean μ.

We’ve already (figure 2) seen what happens when k=1. The following table (figure 3) shows the effect of increasing k: subdividing the ring into shorter intervals, and then taking several of them together to be the range of keys each node is responsible for.

k 2 3 4 5
k 10 15 20 25
k 50 75 100 125
Figure 3. Effects of increasing k. The green areas show results from simulation; the blue curve overlaid on each plot is the Erlang distribution with μ=1/kn. The mean interval length, 1/10,000, is marked with a vertical line.

We see that the distribution of load across nodes gets fairer as k increases, becoming closer and closer to a normal distribution with on average 1/n of the ring allocated to each node.

The distribution gets narrower and narrower as k gets large. Once k is large enough, we can plausibly use a normal approximation to the Erlang distribution. This lets us estimate the standard deviation for large k to be 1/sqrt(k) of the expected allocation size.

That is, every time we double k, the distribution only gets sqrt(2) times tighter. Considering that the DHT’s ring maintenance protocol involves work proportional to k, it’s clear that beyond a certain point a high virtual-to-physical node ratio becomes prohibitive in terms of ring maintenance costs.

Furthermore, lookups in Chord-style DHTs take O(log n) hops on average. Increasing the number of nodes in the ring by a factor of k makes lookup take O(log n + log k) hops.

Where do these distributions come from in the first place?

Imagine walking around the ring, clockwise from key 0. Because we choose node IDs uniformly at random, then as we walk at a constant rate, we have a constant probability of stumbling across a node each step we take. This makes meeting a node on our walk a Poisson process. Wikipedia tells us that the exponential distribution “describes the time between events in a Poisson process,” and that summing multiple independent exponentially-distributed random variables gives us an Erlang distribution, so here we are.


The distribution of responsibility for segments of the ring among the nodes in a randomly-allocated Chord-style DHT is unfair, with some nodes receiving too large an interval of keyspace.

Interval lengths in such a ring follow an exponential distribution. Increasing the number of virtual nodes per physical node leads to load allocation following an Erlang distribution, which improves the situation, but only at a cost of increased ring-maintenance overhead and more hops in key lookups.

OK, so it turns out other people have already thought all this through

I should have Googled properly for this stuff before I started thinking about it.

A couple of easily-findable papers [2,3] mention the exponential distribution of interval sizes. Some (e.g. [3]) even mention the gamma distribution, which is the generalization of the Erlang distribution to real k. There are some nice measurements of actual DHT load closely following an exponential distribution in [4], which is a paper on improving load balancing in Chord-style DHTs. Karger and Ruhl [5] report on a scheme for maintaining a bunch of virtual nodes per physical node, but avoiding the ring-maintenance overhead of doing so by only activating one at a time.

Cassandra and its virtual nodes

Cassandra, a NoSql database using a DHT for scalability, defaults to 256 virtual nodes (vnodes) per physical node in recent releases. The available information (e.g. this and this) suggests this is for quicker recovery on node failure, and not primarily for better load-balancing properties. From this document I gather that this is because Cassandra never used to randomly allocate node IDs: you used to choose them up-front. New releases do choose node IDs randomly though. Here, they report some experience with the new design:

However, variations of as much as 7% have been reported on small clusters when using the num_tokens default of 256.

This isn’t so surprising, given that our Erlang distribution tells us that choosing k=256 should yield a standard deviation of roughly 1/16 of the expected interval size.


[1] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan, “Chord: A scalable peer-to-peer lookup service for internet applications,” in ACM SIGCOMM, 2001. PDF available online.

[2] H. Niedermayer, S. Rieche, K. Wehrle, and G. Carle, “On the Distribution of Nodes in Distributed Hash Tables,” in Proc. Workshop Peer-to-Peer-Systems and Applications (KiVS), 2005. PDF available online.

[3] N. Xiaowen, L. Xianliang, Z. Xu, T. Hui, and L. Lin, “Load distributions of some classic DHTs,” J. Systems Engineering and Electronics 20(2), pp. 400–404, 2009. PDF available online.

[4] S. Rieche, L. Petrak, and K. Wehrle. A Thermal-Dissipation-based Approach for Balancing Data Load in Distributed Hash Tables. In Proc. of IEEE Conference on Local Computer Networks. (LCN 2004), Tampa, FL, USA, November 2004. PDF available online.

[5] D. Karger and M. Ruhl, “Simple efficient load balancing algorithms for peer-to-peer systems,” Proc. Symp. Parallelism in Algorithms and Architectures (SPAA), 2004. PDF available online.

[6] B. Mejías Candia, “Beernet: A Relaxed Approach to the Design of Scalable Systems with Self-Managing Behaviour and Transactional Robust Storage,” PhD thesis, École Polytechnique de Louvain, 2010. PDF available online.

  1. Actually, it’s a geometric distribution because it’s discrete, but the size of the keyspace is so large that working with a continuous approximation makes sense.