The Code Pane is a command line, not a text editor

The system browser in Cuis, a dialect of Squeak Smalltalk, looks like this:

The Cuis 3.3 System Browser

Notice that there is no UI for creating new methods!

There’s a strange and subtle shift in UI design thinking in effect here. Instead of the purpose of the code pane being to “edit some little snippet of code already associated with the current class”, thereby necessitating UI for creating new methods, its purpose is to “interpret any submitted piece of code in the context of the current class”, thereby permitting both definition of new methods and updates to existing methods.

Whenever you type new text into the code pane and accept the changes (by pressing M-s), it compiles whatever text is present as if it were a method definition in the selected class. If that happens to be a new variation on saveTo:, so be it; if it’s an entirely new method, no problem, a new method is compiled and inserted into the class’s method dictionary.

The code pane is really a procedure, accepting arbitrary method source code and having the side effect of compiling that code in the context of the currently-selected class.

The system browser has been this way in Smalltalk since 1980. It’s eye-opening to contrast it with the rigidly file-oriented editors present in IDEs from other systems, from Emacs and Vi up through IntelliJ IDEA and Eclipse.

Constructing Kernel's $if

John Shutt’s PhD thesis introduces the Kernel programming language. Kernel includes an $if, which is treated as a built-in primitive.

It doesn’t have to be primitive, though: in Kernel a construct that does the same thing as the primitive $if operation (in well-typed programs) can be constructed.

In Scheme, one needs to make use of the macro facility to construct an if if there’s not one built-in:

(define true (lambda (first second) first))
(define false (lambda (first second) second))
(define-syntax if
  (syntax-rules ()
    ((_ test if-true if-false)
     ((test (lambda () if-true) (lambda () if-false))))))

In Kernel, we use fexprs instead:

($define! true ($lambda (first second) first))
($define! false ($lambda (first second) second))
($define! $if
  ($vau (test if-true if-false) env
    (eval ((eval test env) if-true if-false) env)))

Here we can see the central opposition between Scheme and Kernel: Scheme’s evaluation of forms is implicit, and we use lambda to postpone evaluation; Kernel’s evaluation of forms is explicit, and we use eval to cause evaluation.

Expressions used as tests absolutely must evaluate to true or false, and nothing else, for these definitions to work. One of the benefits of treating $if as primitive is that one can provide for a separation between Booleans and the rest of the universe that’s not possible otherwise. It’s nice to know, though, that $if can be dispensed with if necessary.

Oakley Groups 2 and 14 with OpenSSL

Yesterday, while experimenting with Diffie-Hellman key exchange as done by the SSH protocol, I discovered (after quite a bit of confusion) that the 1024-bit and 2048-bit parameter groups included with OpenSSL’s Diffie-Hellman implementation are not the same as the 1024-bit “group 2” from RFC 2409 and the 2048-bit “group 14” from RFC 3526! They’re completely unrelated.

Once I discovered this, I spent some time figuring out how to construct PEM files containing the right values formatted appropriately for use with OpenSSL.

If you need to use the “Second Oakley Group” (1024-bit MODP group) from RFC 2409 with OpenSSL, you should load the following parameter file, oakley-group-2.pem, using OpenSSL’s PEM_read_DHparams function:

-----BEGIN DH PARAMETERS-----
MIGHAoGBAP//////////yQ/aoiFowjTExmKLgNwc0SkCTgiKZ8x0Agu+pjsTmyJRSgh5jjQE
3e+VGbPNOkMbMCsKbfJfFDdP4TVtbVHCReSFtXZiXn7G9ExC6aY37WsL/1y29Aa37e44a/ta
iZ+lrp8kEXxLH+ZJKGZR7OZTgf//////////AgEC
-----END DH PARAMETERS-----

If you need to use the 2048-bit MODP group from RFC 3526 with OpenSSL, load the following parameter file, oakley-group-14.pem:

-----BEGIN DH PARAMETERS-----
MIIBCAKCAQEA///////////JD9qiIWjCNMTGYouA3BzRKQJOCIpnzHQCC76mOxObIlFKCHmO
NATd75UZs806QxswKwpt8l8UN0/hNW1tUcJF5IW1dmJefsb0TELppjftawv/XLb0Brft7jhr
+1qJn6WunyQRfEsf5kkoZlHs5Fs9wgB8uKFjvwWY2kg2HFXTmmkWP6j9JM9fg2VdI9yjrZYc
YvNWIIVSu57VKQdwlpZtZww1Tkq8mATxdGwIyhghfDKQXkYuNs474553LBgOhgObJ4Oi7Aei
j7XFXfBvTFLJ3ivL9pVYFxg5lUl86pVq5RXSJhiY+gUQFXKOWoqsqmj//////////wIBAg==
-----END DH PARAMETERS-----

Running the files through openssl asn1parse shows that the contents are the same as the numbers given in the RFCs:

$ openssl asn1parse < oakley-group-2.pem 
0:d=0  hl=3 l= 135 cons: SEQUENCE          
3:d=1  hl=3 l= 129 prim: INTEGER           :FFFFFFFFFFFFFFFFC90FDAA22168C234
C4C6628B80DC1CD129024E088A67CC74020BBEA63B139B22514A08798E3404DDEF9519B3CD3A431B
302B0A6DF25F14374FE1356D6D51C245E485B576625E7EC6F44C42E9A637ED6B0BFF5CB6F406B7ED
EE386BFB5A899FA5AE9F24117C4B1FE649286651ECE65381FFFFFFFFFFFFFFFF
  135:d=1  hl=2 l=   1 prim: INTEGER           :02

$ openssl asn1parse < oakley-group-14.pem 
0:d=0  hl=4 l= 264 cons: SEQUENCE          
--------------------------------------------------------------------------------
4:d=1  hl=4 l= 257 prim: INTEGER           :FFFFFFFFFFFFFFFFC90FDAA22168C234
C4C6628B80DC1CD129024E088A67CC74020BBEA63B139B22514A08798E3404DDEF9519B3CD3A431B
302B0A6DF25F14374FE1356D6D51C245E485B576625E7EC6F44C42E9A637ED6B0BFF5CB6F406B7ED
EE386BFB5A899FA5AE9F24117C4B1FE649286651ECE45B3DC2007CB8A163BF0598DA48361C55D39A
69163FA8FD24CF5F83655D23DCA3AD961C62F356208552BB9ED529077096966D670C354E4ABC9804
F1746C08CA18217C32905E462E36CE3BE39E772C180E86039B2783A2EC07A28FB5C55DF06F4C52C9
DE2BCBF6955817183995497CEA956AE515D2261898FA051015728E5A8AACAA68FFFFFFFFFFFFFFFF
  265:d=1  hl=2 l=   1 prim: INTEGER           :02

AMQP 1.0's "Timestamp" definition is bogus

The AMQP 1.0.0 draft of 2011-07-04 contains the following language defining the meaning of its 64-bit “timestamp” data type:

Encodes a point in time using a 64 bit signed integer representing milliseconds since Midnight Jan 1, 1970 UTC. For the purpose of this representation, milliseconds are taken to be (1/(24*60*60*1000))th of a day.

This language is broken.1

The first sentence there makes it a fixed offset from TAI, which is good. The second sentence there is totally bogus and can have no purpose beyond confusing the crap out of anyone trying to work from this definition. It’d be massively improved by just removing the second sentence:

Encodes a point in time using a 64 bit signed integer representing milliseconds since Midnight Jan 1, 1970 UTC.

For clarity, you might want to add the word “elapsed” and also reconfirm that it’s TAI being discussed here:

Encodes a point in time using a 64 bit signed integer representing elapsed milliseconds since Midnight Jan 1, 1970 UTC. Note that time stamps change in lockstep with TAI, not UTC nor Unix time.

Using a TAI-based metric instead of Unix time (time_t) is good because the Unix time_t definition is a bad design: see this page. Furthermore, UTC, which time_t is an encoding of, cannot be used to talk precisely about moments beyond 6 months in the future, because of the leap-second table that it needs.


TAI-based metrics are almost perfect except that they’re damned awkward to use right on most computer systems out there, because most systems use Unix time. Worse, using TAI but offset so that second zero is the time_t epoch gives a number currently within 30s or so of time_t, which could lead to a lot of confusion.

Choosing a representation of an instant for use in a new protocol is hard. You get a tradeoff between:

  • time_t: ambiguous representation and lack of accuracy when referring to moments more than 6mo in the future; on the plus side, wide support and low potential for making mistakes (other than those caused by the inherent flaws in the definition).

  • UTC, but not encoded as a count of seconds since some epoch: unambiguous representation, but lack of accuracy with future dates still and also poor library support plus the hassle of defining some new representation.

  • TAI offset to be based at the time_t epoch: unambiguous representation, fully accurate, but looks so similar to Unix time that people will probably make mistakes. (Will the mistakes be harmful though? How harmful is being off by 30s? You might ask the Mars Climate Orbiter.)

  • Seconds since 2000-01-01T00:00:00Z, in lockstep with TAI: unambiguous representation, fully accurate, looks dissimilar enough that it won’t be mistaken for Unix time, but jolly weird and still needs a leap-second table for display purposes, like any other TAI-based metric.

  • Chronological Julian Days in GMT: unambiguous representation, fully accurate, but difficult to convert back and forth to Unix time without a full calendar package.

Additionally, if you ever want to simply subtract one timestamp from another to get some idea of the interval between them, Julian Days and Unix time don’t work; TAI-based metrics are the only ones that give accurate results for this use.

Ideally we want these timestamps to be able to be accurate up to and beyond 50 years in the future or so. If it were me, I’d use TAI either with its own built-in epoch, 1958-01-01T00:00:00Z, or with a suitable easy-to-remember epoch such as 2000-01-01T00:00:00Z.

(Thanks to @ciphergoth, @simonmacm, @squaremobius and @michaeljforster for an informative discussion of this issue.)

  1. There’s an open bug report in the AMQP protocol JIRA system for this issue: ISSUE-167 

Signs of an ill-factored system

Dmitry Chestnykh has collected implementations of the echo command from UNIX V5, OpenBSD, Plan 9, FreeBSD, and GNU coreutils.1

ImplementationLines of code
UNIX V510
OpenBSD28
Plan 940
FreeBSD108
GNU coreutils257

From this you might conclude that the GNU implementation has excessive bloat. But what’s really going on is (a) feature creep and (b) the inclusion of documentation within the source code of the program itself. UNIX has evolved away from having a command (with no built-in documentation) with a separate manpage toward having not just a detailed separate manpage but also having quick-reference documentation included in the command itself.

The feature creep leads to a perceived need for the quick-reference documentation. The source to the UNIX V5 implementation is its own reference card! There’d be nothing to write in a usage message.

Both the feature-creep and the awkwardly-placed documentation are signs of an ill-factored system. A better factoring would keep commands single-purpose and as simple as possible, with documentation managed consistently across the whole system.

AttributeUNIX V5Modern UnixSmalltalk
Command (method) source codeShort, single-purposeLong, multi-purposeShort, single-purpose
Command documentationManpage onlyManpage, info file, built-in to program, web siteMethod comment only
How are new variants on commands added?Shell scriptsAdd the feature to the existing commandAdd new methods and refactor common implementation

I think UNIX V5 and Smalltalk are much better-factored than modern Unixes. Factoring out common functionality and assembling the pieces using scripting is the way things are done in both cases. Smalltalk has an advantage in that the scripting language is the systems programming language, but the core philosophies of the two systems have a lot in common.

Removing the inline quick-reference documentation from the GNU program lops 60 lines off the total. The remainder of the growth can be attributed to the baked-in extra features (GNU supports multiple command-line syntaxes, where UNIX V5 supports… nothing) and to premature2 optimisation (Plan 9 and FreeBSD avoid multiple writes, where UNIX V5 goes for TSTTCPW).

To my mind, either of the UNIX V5 or OpenBSD implementations are perfectly acceptable. The remainder are signs of a sick ecosystem: there’s nothing wrong with them intrinsically, but when they’re seen as part of a larger whole things start to look unhealthy.

  1. HT @silentbicycle 

  2. That’s a little unfair. At the time, such optimisations were crucial to get a running system. Things are very different these days: JIT compilation, flexible kernel/userland boundaries, and quite simply the incredible raw speed of modern machines combine to make the difference between many printfs and a single writev irrelevant. 

Desiderata for a Personal Computing System

I am logged in to IRC, and wish to step away from the keyboard to do the dishes, without missing anything addressed to me. I’d like for the computer to read out loud any messages arriving over IRC that mention my IRC handle.

My personal computing system should permit me to:

  1. Gain an interactive representation of my IRC session,
  2. extract an object representing the stream of incoming utterances,
  3. feed that stream into a new process, within which I
  4. filter the stream for mentions of my IRC handle, and
  5. pipe utterances mentioning my handle through the say command-line utility.

I’d then be able to leave that running while I did the dishes, terminating it upon returning to the keyboard.

My personal computing system can’t do that right now. Squeak Smalltalk could do it, in theory, but it’s not yet ready for use as my main system for a number of reasons beyond those given in the linked article.

What's a fairer way of high-speed trading?

High-speed trading has been growing out of control in recent years—or at least, that’s what it looks like from here. (I don’t know anyone working in that sector of the industry at the moment.) Articles like this, this and this discuss the benefits and problems associated with practices such as co-location.

Assuming, for the moment, that it’s unfair to permit such low-latency decision-making, how could a system be set up to keep bandwidth high, but make latency much less important?

What about saying “these trades will settle at an instant chosen uniformly at random from the next three seconds”? Would that make people unable to keep their latencies ultra-low more able to stay competitive in the market?

Why isn't Smalltalk taught?

Introductory CS courses seldom use Smalltalk as the teaching language, preferring instead Scheme, Python, or Java, even though Smalltalk seems (to me at least) to have strong advantages over these languages. Why isn’t Smalltalk taught?

At least one reason might be that there’s no self-contained syntax: you can’t write down whole programs. In situations like exams and quizzes, where students are asked to longhand program text solutions to problems, Smalltalk doesn’t fit well, requiring a browser as it does.

Source code in files means that your programs aren’t atomised as they are when viewed in a browser. I wonder if GNU Smalltalk’s syntax for file-based Smalltalk programs would make sense in a teaching context?

The Origins of Flow Control and Acknowledgements

Erik Meijer recently explained the duality of Enumeration and Observation.1 Subscribing to a feed of items is dual to retrieving an enumerator and going through the items.

Reading his article, I was surprised to realise that there’s a corresponding symmetry between credit-based flow control and message acknowledgement!

Generalising the idea of polling (enumeration) yields flow control, and generalising subscription (observation) yields acknowledgement.

Flow control and Acknowledgement

When designing network protocols such as TCP or AMQP, two problems arise that are both crucial to the robustness and performance of the system being designed:

  • How do we avoid flooding a receiver with messages? and
  • How do we know when a receiver successfully received a message?

The solutions to these problems are flow control and acknowledgements, respectively. The rest of this article will show how these solutions arise from plain old enumeration and observation. I’ll briefly comment on their manifestations in TCP and AMQP toward the end.

Enumeration and Observation

These message sequence diagrams illustrate the use of .NET’s IEnumerator<T> and IObserver<T> as described in Erik’s article. In each diagram, the application is working with some collection containing two values, 11 and 22.

It’s hard to see the symmetry by looking at the diagrams above. The design of .NET’s IEnumerator<T> API is obscuring it by separating MoveNext from Current.

To really see the symmetry properly, we’ll move to a language where we can use algebra to simplify the problem: SML.

SML types for enumeration and observation

The SML type corresponding to IEnumerator<T> might be written as the following (ignoring exceptions, since they aren’t really relevant to the problem at hand): 2

type 'a enumerator = (unit -> bool) * (unit -> 'a)

This is a pair of methods, (MoveNext, Current), corresponding directly to the methods in .NET’s IEnumerator<T> interface. Since it doesn’t make sense to be able to call Current if MoveNext has returned false, by applying a little domain-knowledge and a little algebra we see that this definition is equivalent to the simpler:

type 'a enumerator = unit -> 'a option

We have combined the two methods together, eliminating the possibility of making a mistake. A NONE result from the single combined method indicates that there are no more values available, and a SOME v result carries the next available value v if there is one.

Translating Erik’s almost-mechanically-derived IObserver<T> interface to SML, again ignoring exceptions, and following the 'a option rewrite we used on enumerator, we arrive at:

type 'a observer = 'a option -> unit

The duality is crystal clear at this point. An enumerator has type unit -> 'a option, where an observer has type 'a option -> unit. Enumerators act as senders of information, and Observers act as receivers.

Here’s a message sequence diagram, equivalent to the one above, but using the SML types instead of the .NET interfaces:

Now the symmetry really stands out.

On the left, we can see that when enumerating, an information-free () message indicates that the receiver is ready for the next piece of information from the sender. This is (credit-based) flow control! 3

On the right, we can see that when observing, an information-free () message indicates that the receiver has finished processing the most-recently-sent piece of information from the sender. This is acknowledgement! 4

Enumerators work by the receiver issuing a unit of flow-control credit, and the sender then sending a single value. Observers work by the sender sending a single value, and the receiver then issuing a single acknowledgement message.

Pipelining credit and acknowledgement messages

Let’s take this a step closer to a real network protocol by introducing pipelining of messages. It’s important to be able to reduce latency by decoupling the issuing of a request from the processing of that request. Here we drop the notational distinction between requests and responses, drawing just one kind of arrow for both kinds of message.

All we’ve done here is moved all the “requests” up to the top of each sequence, and all the “responses” to the bottom.

Remember that on the left, () indicates an issue of credit, so that the sender is permitted to send an additional piece of information to the receiver, while on the right, () indicates acceptance of a preceding piece of information by the receiver.

Batching pipelined credit and acknowledgement messages

We can save some bandwidth by applying some of the algebraic tricks we used earlier, looking for a way of compressing those sequences of () messages into informationally equivalent batch credit-issues and acknowledgement messages.

Recall that our enumerator function takes a single () and returns an 'a option, which means either the next available datum or the end of stream:

type 'a enumerator = unit -> 'a option

Switching from SML to abstract type algebra for a moment, let’s model pipelined requests and responses as lists. The algebraic types involved (don’t forget this introduction to type algebra) are

RequestPipeline = 1 + (1     * RequestPipeline)
ReplyPipeline   = 1 + (Value * ReplyPipeline)

These are equivalent to

RequestPipeline = 1 + 1*1   + 1*1*1       + 1*1*1*1           + ...
ReplyPipeline   = 1 + Value + Value*Value + Value*Value*Value + ...

Now, there’s nothing sensible we can do to compress the reply pipeline in the absence of special knowledge about the sequence of values being transferred, but the request pipeline looks very repetitive. It is, in fact, a unary representation of natural numbers. We can use a binary representation instead to save enormous amounts of bandwidth.

Analogous reasoning lets us compress the stream of acknowledgements leaving an observer, so let’s use natural numbers to model our batched credit-issue messages and acknowledgement messages:

Notice how in each case we’ve replaced three separate information-free messages with a single message carrying a natural number, 3. Pipelining and batching credit and acknowledgement messages improves both latency of delivery and usage of available bandwidth.

Analyzing TCP and AMQP

TCP has an elegant flow- and retransmission-control mechanism, the sliding window. Each byte in a TCP stream is separately numbered so that it can be referred to unambiguously. Each TCP segment contains an ACK field that tells a sender which bytes a receiver has seen so far: in other words, the ACK field acknowledges received data. Each segment also contains a Window Size field that tells the sender how much more data it is currently permitted to send: in other words, the amount of credit the receiver is willing to issue.

Looking at things through the lens of the enumerator/observer duality, it’s clear that TCP makes use of both interfaces simultaneously, since it is both issuing flow credit and acknowledging data receipt. There is some generalisation of both interfaces waiting to be exposed! I hope to write more on this in a future article.

Another interesting aspect of TCP is the way that it numbers its FIN segments as part of the transmitted data stream, ensuring reliable 5 notification of stream termination. This corresponds to the acknowledgement of the final NONE in the sequence diagrams above. You might think that the acknowledgement of the end of a stream was optional, but if you omit it, the reliable-delivery characteristics of the protocol change quite substantially: you’re forced to either introduce an out-of-band reliable shutdown subprotocol, or to fairly drastically change the way timeouts are used to decide how long to retain the necessary state.

AMQP also includes something similar to credit-based flow control and message acknowledgements. Flow credit can be arranged by using a Basic.Qos command before transmission begins, and acknowledgement of received messages is done using Basic.Ack commands. However, the protocol combines replenishment of credit with acknowledgement of data: Basic.Ack has a dual role (pardon the pun). Ideally replenishment of flow credit would be done separately from acknowledgement of data, with a command for each task, mirroring the way TCP uses a separate header field for each function.

Besides AMQP’s core subscription system, centred around Basic.Consume, there is a less-commonly-used polling based method of transferring messages, Basic.Get. Basic.Get is very close to being an instance of the simple SML 'a enumerator function type introduced above, and doesn’t include any way of pipelining or batching requests or responses.

Conclusions and future work

The duality between IEnumerator<T> and IObserver<T> extends into a network protocol setting. By looking at it in a network protocol context, we can see the natural emergence of credit-based flow control on the one hand, and acknowledgement on the other. With this model of a data transfer protocol in mind, we can examine existing protocols such as TCP and AMQP to see how they map to the operations we have identified.

There are one or two remaining loose ends I’m interested in following up. Firstly, what is the generalisation of both interfaces, so evident in both TCP and AMQP, like? Whatever it is, it could turn out to be a good candidate for a generalised model of a flow-controlled reliable information stream. Secondly, both IEnumerator<T> and IObserver<T> are inherently stateful approaches to data streaming. What do we see if we switch to pure-functional models of enumeration and observation? What are the analogous network protocol models like? I suspect that the answers here will lead in the direction of more RESTful models of message transmission such as RestMS and ReverseHTTP.

  1. Erik’s article is available as HTML and also as a PDF, which is the form I originally saw it in. Thanks to @hylomorphism for pointing me toward the article and thus kicking off this interesting train of thought. 

  2. SML code illustrating the use of these enumerator and observer types can be found here

  3. There are other kinds of flow control besides credit-based: XON/XOFF and RTS/CTS signalling will be familiar to those that have used serial links and modems. 

  4. Specifically, transfer of responsibility. The receiver of a message, by sending () to the sender of the message, indicates that it has taken responsibility for the message, and that the sender no longer needs to concern itself with it or attempt further redelivery. There are other kinds of acknowledgement scheme: NAKs, for example, are used in multicast applications to indicate a gap in a received data stream. 

  5. I should really put the word “reliable” in scare quotes… 

Announcers are Networks

Recently I’ve been using Announcements, a framework for a particular style of event pub/sub within a Smalltalk image. Squeak Smalltalk has a handful of different implementations of the same pub/sub pattern in the default image, and Announcements is a nicely designed generalisation of these.

But why would one want to have a particular implementation of such a common design pattern as pub/sub baked in to the image?

One way of thinking about it, that I’m becoming more and more convinced by, is this:

  • If Objects are abstract representations of computers, then
  • Announcers are abstract representations of networks.

The communication channel between two objects is normally a regular message send. Regular message sends provide Plain Old RPC (with Exceptions). Sometimes, however, you want one or more of the following things to happen:

  • a message to be delivered to more than one recipient
  • a message to be delivered to an arbitrary or unknown number of recipients
  • a message to be delivered asynchronously, or without reply
  • a message to be delivered across a real network to some remote object

Combinations of the above are useful and interesting too.

It’s in situations like these that having a first-class representation of the medium of communication between objects becomes important. And that’s just what Announcers are in the Announcement framework.

One of the too-many things I’m thinking about at the moment is how to generalise Announcers to take into consideration more than just the first two of the four points listed above. Designing and implementing the basic facility seems straightforward, but one runs fairly quickly into issues of concurrency and fault-tolerance.