Ohm is “a library and domain-specific language for parsing and
language preprocessor. I’ll build a simple compiler for ES5 extended
with a new kind of
for loop, using Ohm and the ES5 grammar
included with it.
All the code in this post is available in a Github repo.
Our toy extension: “for five”
We will add a “for five” statement to ES5, which will let us write programs like this:
The new construct simply runs its body five times in a row, binding a loop variable in the body. Running the program above through our compiler produces:
Extending the ES5 grammar
We write our extension to the ES5 grammar in a new file
Let’s take this a piece at a time. First of all, the declaration
<: ES5 tells Ohm that the new grammar should be called
that it inherits from a grammar called
extends the existing ES5 grammar’s
nonterminal with a new production that will be called
Finally, we define two new nonterminals as convenient shorthands for parsing the two new keywords, and augment the existing keyword definition:
There are three interesting points to be made about keywords:
First of all, making something a keyword rules it out as an identifier. In our extended language, writing
var five = 5is a syntax error. Define new keywords with care!
We make sure to reject input tokens that have our new keywords as a prefix by defining them as their literal text followed by anything that cannot be parsed as a part of an identifier,
~identifierPart. That way, the compiler doesn’t get confused by, say,
five_more, which remain valid identifiers.
By making sure to extend
keyword, tooling such as syntax highlighters can automatically take advantage of our extension, if they are given our extended grammar.
Translating source code using the new grammar
First, require the
ohm-js NPM module and its included ES5 grammar:
Next, load our extended grammar from its definition in
compile it. When we compile the grammar, we pass in a namespace that
makes the ES5 grammar available under the name our grammar expects,
Finally, we define the translation from our extended language to plain
ES5. To do this, we extend a semantic function,
adding a method for each new production rule. Ohm automatically uses
defaults for rules not mentioned in our extension.
Each parameter to the
IterationStatement_for5_named method is a
syntax tree node corresponding positionally to one of the tokens in
the definition of the parsing rule. Accessing the
asES5 attribute of
a syntax tree node computes its translated source code. This is done
with recursive calls to the
modifiedSource attribute where required.
Our compiler is, at this point, complete. To use it, we need code to feed it input and print the results:
This style of syntactic extension is quite coarse-grained: we must
translate whole compilation units at once, and must specify our
extensions separately from the code making use of them. There is no
way of adding a local syntax extension scoped precisely to a block
of code that needs it (known to Schemers as
style of syntax extension than the one explored in this post.
Mention of sweet.js leads me to the thorny topic of hygiene. Ohm is a parsing toolkit. It lets you define new concrete syntax, but doesn’t know anything about scope, or about how you intend to use identifiers. After all, it can be used for languages that don’t necessarily even have identifiers. So when we write extensions in the style I’ve presented here, we must write our translations carefully to avoid unwanted capture of identifiers. This is a tradeoff: the broad generality of Ohm’s parsing in exchange for less automation in identifier handling.
Ohm’s extensible grammars let us extend any part of the language, not just statements or expressions. We can specify new comment syntax, new string syntax, new formal argument list syntax, and so on. Because Ohm is based on parsing expression grammars, it offers scannerless parsing. Altering or extending a language’s lexical syntax is just as easy as altering its grammar.
We have defined an Ohm-based compiler for an extension to ES5 syntax, using only a few lines of code. Each new production rule requires, roughly, one line of grammar definition, and a short method defining its translation into simpler constructs.
You can try out this little compiler, and maybe experiment with your own extensions, by cloning its Github repo.