BitSyntax for Smalltalk

Part of a series: #squeak-phone

Hand-written binary parsing/unparsing sucks

As I’ve been working on a mobile Smalltalk system, I’ve found myself needing to decode and encode a number of complex telephony packet formats1 such as the following, an incoming SMS delivery message containing an SMS-DELIVER TPDU in GSM 03.40 format, containing seven-bit (!) GSM 03.38-encoded text:

02 01 ffff
01 28 07911356131313f3
04 0b911316325476f8 000002909021044480 0ec67219644e83cc6f90b9de0e01

It turns out there are a plethora of such binary formats needed to get a working cellphone.

I started off hand-rolling them, but it quickly became too much, so I borrowed liberally stole from Erlang, and implemented BitSyntax for Smalltalk. (After all, I am already using Erlang-influenced actors for the Smalltalk system daemons!)

I’ve done this before, for Racket, and there are plenty of other similar projects for e.g. JavaScript and OCaml.

Every language needs a BitSyntax, it seems!

What does BitSyntax do?

The BitSyntax package includes a BitSyntaxCompiler class which interprets BitSyntaxSpecification objects, producing reasonably efficient Smalltalk for decoding and encoding binary structures, mapping from bytes to instance variables and back again.

The interface to the compiled code is simple. After compiling a BitSyntaxSpecification for the data format above, we can analyze the example message straightforwardly:

parsedMessage := SmsIncoming loadFrom: (ByteArray fromHex:
    '02 01 ffff
     01 28 07911356131313f3
     04 0b911316325476f8 000002909021044480 0ec67219644e83cc6f90b9de0e01')

and, if we wish, serialize it again:

serializedBytes := ByteArray streamContents: [:w | parsedMessage saveTo: w]

How does it work?

Syntax specifications are built using an embedded domain-specific language (EDSL).

For example, for the above data format, we would supply the following spec for class SmsIncoming:

        (1 byte >> #msgType),
        (1 byte >> #type),
        (2 bytesLE >> #simIndex),
        (1 byte >> #id),
        ((1 byte storeTemp: #payloadLength expr: 'payload size'), 'payloadLength' bytes)
            >>> #payload <<<
                (SmsAddress codecCountingOctets >> #smscAddress),
                (SmsPdu codecIncoming >> #tpdu))

along with appropriate specs for SmsAddress and SmsPdu (omitted for space reasons here) and the following for the SmsPdu subclass SmsPduDeliver:

        (1 bit boolean >> #replyPath),
        (1 bit boolean >> #userDataHeaderIndicator),
        (1 bit boolean >> #statusReportIndication),
        (2 bits),
        (1 bit boolean >> #moreMessagesToSend),
        (2 bits = 0),

        (SmsAddress codecCountingSemiOctets >> #originatingAddress),
        (1 byte >> #protocolIdentifier),
        (1 byte >> #dataCodingScheme),
        ((7 bytes
                transformLoad: [:v | 'self class decodeSmscTimestamp: ', v]
                save: [:v | 'self class encodeSmscTimestamp: ', v])
            >> #serviceCentreTimeStamp),
        (((1 byte >> #itemCount)
            transformLoad: [:v | 'self userDataOctetsFor: ', v]
            save: [:v | 'itemCount'])
                storeTemp: #userDataLength expr: 'userData size'),
        (#userDataLength bytes >> #userData)

These are non-trivial examples; the simple cases are simple, and the complex cases are usually possible to express without having to write code by hand. The EDSL is extensible, so more combinators and parser types can be easily added as the need arises.

How do I get it?

Load it into an up-to-date trunk Squeak image:

(Installer squeaksource project: 'BitSyntax')
    install: 'BitSyntax-Core';      "the compiler and EDSL"
    install: 'BitSyntax-Examples';  "non-trivial examples"
    install: 'BitSyntax-Help'.      "user guide and reference"

You can also visit the project page directly.

The package BitSyntax-Help contains an extensive manual written for Squeak’s built-in documentation system.


  1. Telephony packet formats are particularly squirrelly in places. Seven-bit text encoding? Really? Multiple ways to encode phone numbers. Lengths sometimes in octets, sometimes in half-octets, sometimes in septets (!) with padding implicit. Occasional eight-bit data shoehorned into a septet-based section of a message. Bit fields everywhere. Everything is an acronym, cross-referenced to yet another document. Looking at the 3GPP and GSM specs gave me flashbacks to the last time I worked in telephony, nearly 20 years ago…