Chord-style DHTs don't fairly balance their load without a fight

Chord-style DHT ring

Figure 1: The ring structure of a Chord-style DHT, with nodes placed around the ring.

I’ve been experimenting with Distributed Hash Tables (DHTs) recently. A common base for DHT designs is Chord [1], where nodes place themselves randomly onto a ring-shaped keyspace; figure 1 shows an example.

One of the motivations of a DHT is to fairly distribute keyspace among participating nodes. If the distribution is unbalanced, some nodes will be doing an unfairly large amount of work, and will be storing an unfairly large amount of data on behalf of their peers.

This post investigates exactly how unfair it can get (TL;DR: very), and examines some options for improving the fairness of the span distribution (TL;DR: virtual nodes kind of work).

Background: Chord-style DHTs

Chord-style DHTs map keys to values. Keys are drawn from a k-bit key space. It’s common to imagine the whole key space, between 0 and 2k-1, as a ring, as illustrated in figure 1.

Each node in the DHT takes responsibility for some contiguous span of the key space. A newly-created node chooses a random number between 0 and 2k-1, and uses it as its node ID. When a node joins the DHT, one of the first things it does is find its predecessor node. Once it learns its predecessor’s node ID, it takes responsibility for keys in the span (pred, self] between its predecessor’s ID and its own ID [6].

If the keys actually in use in a particular DHT are uniformly distributed at random, then the load on a node is directly proportional to the fraction of the ring’s keyspace it takes responsibility for. I’ll assume this below without further comment.

Below, I’ll use n to indicate the number of nodes in a particular DHT ring.

Expected size of a node’s region of responsibility

My intuitions for statistics are weak, so I wrote a short and ugly program to experimentally explore the distribution of responsibility interval sizes. My initial assumption (naive, I know) was that the sizes of intervals between adjacent nodes would be normally distributed.

Responsibility interval distribution

Figure 2: Distribution of distances between adjacent nodes, from a simulated 10,000-node DHT. The mean interval length, 1/10,000 = 10-4, is marked with a vertical line. The median (not marked) is 0.69×10-4: less than the mean. The blue curve is the exponential distribution with β=10-4.

I was wrong! The actual distribution, confirmed by experiment, is exponential1 (figure 2).

In an exponential distribution, the majority of intervals are shorter than the expected length of 1/n. A few nodes are given a very unfair share of the workload, with an interval much longer than 1/n.

How bad can it get?

The unfairness can be as bad as a factor of O(log n). For example, in DHTs with 104 nodes, nodes have to be prepared to store between ten and fifteen times as many key/value pairs than the mean.

Improving uniformity, while keeping random node ID allocation

How can we avoid this unfairness?

One idea is to introduce many virtual nodes per physical node: to let a physical node take multiple spots on the ring, and hence multiple shorter keyspace intervals of responsibility. It turns out that this works, but at a cost.

If each physical node takes k points on the ring, we end up with kn intervals, each of expected length 1/kn. The lengths of these shorter intervals are exponentially distributed. Each node takes k of them, so to figure out how much responsibility, and hence how much load, each node will have, we need the distribution describing the sum of the interval lengths.

An Erlang distribution gives us exactly what we want. From Wikipedia:

[The Erlang distribution] is the distribution of the sum of k independent exponential variables with mean μ.

We’ve already (figure 2) seen what happens when k=1. The following table (figure 3) shows the effect of increasing k: subdividing the ring into shorter intervals, and then taking several of them together to be the range of keys each node is responsible for.

k 2 3 4 5
Distribution
k 10 15 20 25
Distribution
k 50 75 100 125
Distribution
Figure 3. Effects of increasing k. The green areas show results from simulation; the blue curve overlaid on each plot is the Erlang distribution with μ=1/kn. The mean interval length, 1/10,000, is marked with a vertical line.

We see that the distribution of load across nodes gets fairer as k increases, becoming closer and closer to a normal distribution with on average 1/n of the ring allocated to each node.

The distribution gets narrower and narrower as k gets large. Once k is large enough, we can plausibly use a normal approximation to the Erlang distribution. This lets us estimate the standard deviation for large k to be 1/sqrt(k) of the expected allocation size.

That is, every time we double k, the distribution only gets sqrt(2) times tighter. Considering that the DHT’s ring maintenance protocol involves work proportional to k, it’s clear that beyond a certain point a high virtual-to-physical node ratio becomes prohibitive in terms of ring maintenance costs.

Furthermore, lookups in Chord-style DHTs take O(log n) hops on average. Increasing the number of nodes in the ring by a factor of k makes lookup take O(log n + log k) hops.

Where do these distributions come from in the first place?

Imagine walking around the ring, clockwise from key 0. Because we choose node IDs uniformly at random, then as we walk at a constant rate, we have a constant probability of stumbling across a node each step we take. This makes meeting a node on our walk a Poisson process. Wikipedia tells us that the exponential distribution “describes the time between events in a Poisson process,” and that summing multiple independent exponentially-distributed random variables gives us an Erlang distribution, so here we are.

Conclusion

The distribution of responsibility for segments of the ring among the nodes in a randomly-allocated Chord-style DHT is unfair, with some nodes receiving too large an interval of keyspace.

Interval lengths in such a ring follow an exponential distribution. Increasing the number of virtual nodes per physical node leads to load allocation following an Erlang distribution, which improves the situation, but only at a cost of increased ring-maintenance overhead and more hops in key lookups.


OK, so it turns out other people have already thought all this through

I should have Googled properly for this stuff before I started thinking about it.

A couple of easily-findable papers [2,3] mention the exponential distribution of interval sizes. Some (e.g. [3]) even mention the gamma distribution, which is the generalization of the Erlang distribution to real k. There are some nice measurements of actual DHT load closely following an exponential distribution in [4], which is a paper on improving load balancing in Chord-style DHTs. Karger and Ruhl [5] report on a scheme for maintaining a bunch of virtual nodes per physical node, but avoiding the ring-maintenance overhead of doing so by only activating one at a time.

Cassandra and its virtual nodes

Cassandra, a NoSql database using a DHT for scalability, defaults to 256 virtual nodes (vnodes) per physical node in recent releases. The available information (e.g. this and this) suggests this is for quicker recovery on node failure, and not primarily for better load-balancing properties. From this document I gather that this is because Cassandra never used to randomly allocate node IDs: you used to choose them up-front. New releases do choose node IDs randomly though. Here, they report some experience with the new design:

However, variations of as much as 7% have been reported on small clusters when using the num_tokens default of 256.

This isn’t so surprising, given that our Erlang distribution tells us that choosing k=256 should yield a standard deviation of roughly 1/16 of the expected interval size.


References

[1] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan, “Chord: A scalable peer-to-peer lookup service for internet applications,” in ACM SIGCOMM, 2001. PDF available online.

[2] H. Niedermayer, S. Rieche, K. Wehrle, and G. Carle, “On the Distribution of Nodes in Distributed Hash Tables,” in Proc. Workshop Peer-to-Peer-Systems and Applications (KiVS), 2005. PDF available online.

[3] N. Xiaowen, L. Xianliang, Z. Xu, T. Hui, and L. Lin, “Load distributions of some classic DHTs,” J. Systems Engineering and Electronics 20(2), pp. 400–404, 2009. PDF available online.

[4] S. Rieche, L. Petrak, and K. Wehrle. A Thermal-Dissipation-based Approach for Balancing Data Load in Distributed Hash Tables. In Proc. of IEEE Conference on Local Computer Networks. (LCN 2004), Tampa, FL, USA, November 2004. PDF available online.

[5] D. Karger and M. Ruhl, “Simple efficient load balancing algorithms for peer-to-peer systems,” Proc. Symp. Parallelism in Algorithms and Architectures (SPAA), 2004. PDF available online.

[6] B. Mejías Candia, “Beernet: A Relaxed Approach to the Design of Scalable Systems with Self-Managing Behaviour and Transactional Robust Storage,” PhD thesis, École Polytechnique de Louvain, 2010. PDF available online.

  1. Actually, it’s a geometric distribution because it’s discrete, but the size of the keyspace is so large that working with a continuous approximation makes sense. 

A calling convention for ARM that supports proper tail-calls efficiently

Because proper tail calls are necessary for object-oriented languages, we can’t quite use the standard calling conventions unmodified when compiling OO languages efficiently to ARM architectures.

Here’s one approach to a non-standard, efficient, tail-call-supporting calling convention that I’ve been exploring recently.

The big change from the standard is that we do not move the stack pointer down over outbound arguments when we make a call.

Instead, the callee moves the stack pointer as they see fit. The reason for this is so that the callee can tail-call someone else without having to do any hairy adjusting of the frame, and so that the original caller doesn’t have to know anything about what’s left to clean up when they receive control: all the clean-up has already been completed.

This bears stating again: just after return from a subroutine, all clean-up has already been completed.

In the official standard, the stack space used to communicate arguments to a callee is owned by the caller. In this modified convention, that space is owned by the callee as soon as control is transferred.

Other aspects of the convention are similar to the AAPCS standard:

  • keep the stack Full Descending, just like the standard.
  • ensure it is 8-byte aligned at all times, just like (a slight restriction of) the standard.
  • make outbound arguments leftmost-low in memory, that is, “pushed from right to left”. This makes the convention compatible with naive C struct overlaying of memory.
  • furthermore, ensure argument 0 in memory is also 8-byte aligned.

Details of the stack layout

Consider compiling a single subroutine, either a leaf or a non-leaf routine. We need to allocate stack space to incoming arguments, to saved temporaries, to outbound arguments, and to padding so we maintain the correct stack alignment. Let

  • Ni = inward-arg-count, the number of arguments the routine expects
  • No = most-tail-args, the largest number of outbound tail-call arguments the routine produces
  • Nt = inward-temp-count, the number of temps the routine requires
  • Na = outward-arg-count, the number of arguments supplied in a particular call the routine makes to some other routine

Upon entry to the routine, where Ni=5, No=7, Nt=3, Na=3, we have the following stack layout. Recall that stacks are full-descending.

(low)                                                               (high)
    | outbound  |   |   temps   |   |shuffle|      inbound      |
    | 0 | 1 | 2 |---| 0 | 1 | 2 |---| - | - | 0 | 1 | 2 | 3 | 4 |---|
                    ^                                               ^
                  sp for non-leaf                                sp for leaf

I’ve marked two interesting locations in the stack: the position of the stack pointer for leaf routines, and the position of the stack pointer for non-leaf routines, which need some space of their own to store their internal state at times when they delegate to another routine. Leaf routines simply leave the stack pointer in place as they start execution; non-leaf routines adjust the stack pointer themselves as control arrives from their caller.

Note that the first four arguments are transferred in registers, but that stack slots still need to be reserved for them. Note also the padding after the outbound arguments, the temps, and the inbound/shuffle-space.

The shuffle-space is used to move values around during preparation for a tail call whenever the routine needs to supply more arguments to the tail-called routine than it received in turn from its caller.

The extra shuffle slots are only required if there’s no room in the inbound slots plus padding. For example, if Ni=5 and No=6, then since we expect the inbound arguments to have one slot of padding, that slot can be used as shuffle space.

Addressing calculations

Leaf procedures do not move the stack pointer on entry. Nonleaf procedures do move the stack pointer on entry. This means we have different addressing calculations depending on whether we’re a leaf or nonleaf procedure.

  • Pad8(x) = x rounded up to the nearest multiple of 8.
  • sp_delta = Pad8(No * 4) + Pad8(Nt * 4), the distance SP might move on entry and exit.

Leaf procedures, where the stack pointer does not move on entry to the routine:

inward(n) = rn, if n < 4
          | sp - Pad8(Ni * 4) + (n * 4)
temp(n) = sp - sp_delta + (n * 4)
outward(n) (tail calls only) = rn, if n < 4
                             | sp - Pad8(Na * 4) + (n * 4)

Nonleaf procedures, where the stack pointer moves down by sp_delta bytes on entry to the routine:

inward(n) = rn, if n < 4
          | sp + sp_delta - Pad8(Ni * 4) + (n * 4)
temp(n) = sp + (n * 4)
outward(n) (non-tail calls) = rn, if n < 4
                            | sp - Pad8(Na * 4) + (n * 4)
outward(n) (tail calls) = rn, if n < 4
                        | sp + sp_delta - Pad8(Na * 4) + (n * 4)

Variations

This convention doesn’t easily support varargs. One option would be to sacrifice simple C struct overlaying of the inbound argument stack area, flipping arguments so they are pushed from left to right instead of from right to left. That way, the first argument is always at a known location.

Another option would be to use an argument count at the end of the argument list in the varargs case. This requires both the caller and callee to be aware that a varargs-specific convention is being used.

Of course, varargs may not even be required: instead, a vector could be passed in as a normal argument. Whether this makes sense or not depends on the language being compiled.

Successors to "Enterprise Integration Patterns"?

“Enterprise Integration Patterns”, by Gregor Hohpe, has been a classic go-to volume for a lot of people working with distributed systems over the years.

It was published back in 2002, though, before things like AMQP, ZeroMQ, Websockets and Twitter.

Is there anything that could be considered an update on the book? Something that covers modern integration scenarios. Perhaps something that touches not only on the newer messaging technologies but also on NoSQL, improvements to the browser environment, and so on.

What are people reading to get a common vocabulary for all this stuff and to get their heads around how the pieces fit together?

Mac OS X gripes

I’ve been using a Mac as my personal computer since late 2003. It’s been fine for all of that time, but recent releases of the software are starting to make me want to go back to Debian or Ubuntu, warts and all.

  • Every time I restart my machine, it forgets my trackpad settings.

  • About one time in ten I wake my machine from sleep, it shows the beachball forever and never comes back.

Combine that with the lack of a compiler shipped with the machine by default, and Debian is starting to look downright attractive again.

'bad_vertex' errors while developing and testing RabbitMQ plugins

Today I have been doing maintenance work on an old RabbitMQ plugin of mine. Part of this work was updating its Makefiles to work with the latest RabbitMQ build system.

The problem and symptom

After getting it to compile, and trying to run it, I started seeing errors like this:

Error: {'EXIT',
          {{badmatch,
               {error,
                   {edge,{bad_vertex,mochiweb},rabbitmq_mochiweb,mochiweb}}},
           [{rabbit_plugins,dependencies,3,
                [{file,"src/rabbit_plugins.erl"},{line,100}]},
            {rabbit_plugins_main,format_plugins,4,
                [{file,"src/rabbit_plugins_main.erl"},{line,184}]},
            {rabbit_plugins_main,start,0,
                [{file,"src/rabbit_plugins_main.erl"},{line,70}]},
            {init,start_it,1,[]},
            {init,start_em,1,[]}]}}

Not just from rabbitmq-plugins, but also a similar error when starting the RabbitMQ server itself.

The reason turned out to be simple: I had symbolically linked the rabbitmq-mochiweb and mochiweb-wrapper directories into my plugins directory, as per the manual, but what the manual didn’t say was that this works for all plugins except the -wrapper plugins (and the Erlang “client” plugin, rabbitmq-erlang-client a.k.a. amqp_client.ez).

The solution

Symlink all the plugins except the wrapper plugins and amqp_client.ez.

The wrapper plugin *.ez files and amqp_client.ez need to be present in the plugins directory itself. So instead of the instructions given, try the following steps:

mkdir -p rabbitmq-server/plugins
cd rabbitmq-server/plugins
ln -s ../../rabbitmq-mochiweb
cp rabbitmq-mochiweb/dist/mochiweb*ez .
cp rabbitmq-mochiweb/dist/webmachine*ez .
../scripts/rabbitmq-plugins enable rabbitmq_mochiweb

A working configuration for me has the following contents of the plugins directory:

total 816
-rw-r--r--   1 tonyg  staff  260123 Sep 17 18:36 mochiweb-2.3.1-rmq0.0.0-gitd541e9a.ez
lrwxr-xr-x   1 tonyg  staff      23 Sep 17 17:59 rabbitmq-mochiweb -> ../../rabbitmq-mochiweb
-rw-r--r--   1 tonyg  staff  149142 Sep 17 18:37 webmachine-1.9.1-rmq0.0.0-git52e62bc.ez

Co-optional arguments?

Racket and many other languages have the concept of optional arguments at the receiver side:

(define (a-procedure mandatory-arg #:foo [foo 123])
  ;; Use mandatory-arg and foo as normal
  ...)

(a-procedure mandatory-value #:foo 234) ;; override foo
(a-procedure mandatory-value) ;; leave foo alone

Here, the procedure is offering its caller the option of supplying a value for foo, and if the caller does not do so, uses 123 as the value for foo. The caller is permitted not to care (or even to know!) about the #:foo option.

Almost no language that I know of has the symmetric case: an optional argument at the caller side (and because the feature doesn’t exist, I’m having to invent syntax to show what I mean):

(define (run-callback callback)
  (callback mandatory-value [#:foo 234]))

(run-callback (lambda (mandatory-arg) ...)) ;; ignores the extra value
(run-callback (lambda (mandatory-arg #:foo [foo 123]) ...)) ;; uses it

The intent here is that the procedure is permitted not to care (or even to know) about the additional value its caller supplies.

This would be useful for callbacks and other situations where the code invoking a procedure has potentially-useful information that the specific (and statically-unknown) procedure being called may or may not care about.

I said above that almost no languages have this feature: it turns out that Squeak and Pharo have something similar in their BlockClosure>>#cull: methods. The following code works when given a closure accepting zero, one or two arguments:

someBlock cull: 1 cull: 2 cull: 3

(Update: @msimoni points out that Common Lisp’s allow-other-keys is similar, too. It still requires the callee to declare something extra, though.)

The Software Crisis, 2012 edition

This morning’s breakfast experiment involves rough estimation of the changes in cost and capacity of computing power over the last few years, and imagining where that will leave us in ten years.

In summary: hardware is already superpowered. The gap between hardware and software is already huge. It’s only going to get worse. What is to be done about the parlous state of software?

SD Cards

Rough googling suggests a 512MB card cost about $60 in 2005. That’s around $68 in 2012 dollars. A commodity 32GB card costs about $25 today. Assuming exponential decline in price, or equivalently exponential increase in capacity for constant cost, we see that we can expect capacity per dollar to roughly double year on year.

CPU

More rough googling suggests CPU capacity (measured in GFLOPS) is increasing at roughly a factor of 1.6 per year. GPUs are improving more quickly, approximately doubling in speed each year.

DRAM

Wikipedia informs me that in 1971, DRAM cost 5c per bit, and in 1999 it cost 20µc per bit. Again assuming exponential scaling, that gives approximately a 1.56 increase in capacity year on year.

Total

Component201220172022
Storage1~32~1024
Compute1~10.5~110
RAM1~9.2~85

In five years time, expect to be working with machines that are ten times as fast, that have ten times as much RAM, and that have thirty-two times as much secondary storage as today’s machines.

In ten years time, expect machines one hundred times as fast, with one hundred times the amount of RAM, and one thousand times the amount of storage.

I’m not even sure how to measure progress in software. But my impression is that it isn’t keeping up its end of the bargain. Perhaps we’re seeing linear improvement, at best.

I think a big part of the problem is that our ambition hasn’t increased to match our capacity. We haven’t kept our expectations from software in line with the ability of our new hardware.

RabbitHub Status, May 2012

RabbitHub, an implementation of PubSubHubBub I built as part of my work on RabbitMQ, hasn’t been maintained for a while now. It lags the current state-of-the-art in two respects:

  1. it will need straightforward updates to work with the current versions of RabbitMQ Server, and
  2. it implements PSHB version 0.1 rather than the current 0.3.

Fixing the former would take about two days of expert work, and I don’t know how much work fixing the latter would be. Both perfectly reasonable to consider, though.

RabbitHub is unique, as far as I know, in one respect: it uses PSHB as a generic webhook-based messaging protocol, rather than the more narrow Atom distribution protocol that the designers have in mind. There are other PSHB implementations, notably the reference implementation and Superfeedr, but they all as far as I know are specialized to Atom transport. A list of hub implementations can be found at http://code.google.com/p/pubsubhubbub/wiki/Hubs.

Fame, of a sort

Per Chris Cunnington’s recent survey, it looks like two of my SqueakSource projects have more than 1000 downloads and are therefore in the top 500 projects on the site! ;-)

and I’ll somewhat cheekily claim a small portion of the credit for Max Leske’s GitFS, with 8339 downloads, since it was originally based on my project Git for Squeak at its core, but has definitely moved far from the humble beginning I gave it.

Network server programming with OCaml

Some years ago, I experimented with networking in SML/NJ and spent a few hours figuring out how to write a multithreaded TCP/IP server using SML/NJ. Here, I’ve performed the same exercise with OCaml 3.12.1.

The code for this example is on github.

Download source code, building, and running

The following example is comprised of a _tags file for controlling ocamlbuild and the .ml file itself. The complete sources:

Running the following command compiles the project:

ocamlbuild test.native

The ocamlbuild output is a native executable. The executable is placed in the _build directory, and a symbolic link to the executable is placed in the working directory. To run the program:

./test.native
The build control file

The _tags file contains

true: use_unix
true: thread

which instructs the build system to include the Unix POSIX module, which provides a BSD-sockets API, and to configure the OCaml runtime to support the Thread lightweight-threads module. For more information about ocamlbuild, see here.

The example source code

Turning to test.ml now, we first bring the contents of a few modules into scope:

open Unix
open Printf
open Thread

The Unix module, mentioned above, provides a POSIX BSD-sockets API; Printf is for formatted printing; and Thread is for multithreading. We’ll be using a single thread per connection. Other models are possible.

OCaml programs end up being written upside down, in a sense, because function definitions need to precede their use (unless mutually-recursive definitions are used). For this reason, the next chunk is conn_main, the function called in a new lightweight thread when an inbound TCP connection has been accepted. Here, it simply prints out a countdown from 10 over the course of the next five seconds or so, before closing the socket. Multiple connections end up running conn_main in independent threads of control, leading automatically to the natural and obvious interleaving of outputs on concurrent connections.

let conn_main s =
  let cout = out_channel_of_descr s in
  let rec count n =
    match n with
    | 0 ->
        fprintf cout "Bye!\r\n%!"
    | _ ->
        fprintf cout "Hello %d\r\n%!" n;
        Thread.delay 0.5;
        count (n - 1)
  in
  count 10;
  printf "Closing the connection.\n%!";
  close s

Note the mysterious %! format specifiers in the format strings: these translate into calls to flush, forcing buffered output out to the actual socket. The alternative would be to call flush cout directly ourselves.

The function that depends on conn_main is the accept loop, which repeatedly accepts a connection and spawns a connection thread for it.

let rec accept_loop sock =
  let (s, _) = accept sock in
  printf "Accepted a connection.\n%!";
  let _ = Thread.create conn_main s in
  accept_loop sock

Finally, the block of code that starts the whole service running, creating the TCP server socket and entering the accept loop. We set SO_REUSEADDR on the socket, listen on port 8989 with a connection backlog of 5, and enter the accept loop. We catch and ignore SIGPIPE in order to avoid crashing when a client departs unexpectedly.

let _ =
  Sys.set_signal Sys.sigpipe Sys.Signal_ignore;
  let sock = socket PF_INET SOCK_STREAM 0 in
  setsockopt sock SO_REUSEADDR true;
  bind sock (ADDR_INET (inet_addr_of_string "0.0.0.0", 8989));
  listen sock 5;
  printf "Entering accept loop...\n%!";
  accept_loop sock