Javascript syntax extensions using Ohm
Fri 5 Feb 2016 03:57 EST
Programming language designers often need to experiment with syntax for their new language features. When it comes to Javascript, we rely on language preprocessors, since altering a Javascript engine directly is out of the question if we want our experiments to escape the lab.
Ohm is “a library and domain-specific language for parsing and
pattern matching.” In this post, I’m going to use it as a Javascript
language preprocessor. I’ll build a simple compiler for ES5 extended
with a new kind of for
loop, using Ohm and the ES5 grammar
included with it.
All the code in this post is available in a Github repo.
Our toy extension: “for five”
We will add a “for five” statement to ES5, which will let us write programs like this:
The new construct simply runs its body five times in a row, binding a loop variable in the body. Running the program above through our compiler produces:
Extending the ES5 grammar
We write our extension to the ES5 grammar in a new file for5.ohm
as
follows:
Let’s take this a piece at a time. First of all, the declaration For5
<: ES5
tells Ohm that the new grammar should be called For5
, and
that it inherits from a grammar called ES5
. Next,
extends the existing ES5 grammar’s
IterationStatement
nonterminal with a new production that will be called
IterationStatement_for5_named
.
Finally, we define two new nonterminals as convenient shorthands for parsing the two new keywords, and augment the existing keyword definition:
There are three interesting points to be made about keywords:
-
First of all, making something a keyword rules it out as an identifier. In our extended language, writing
var five = 5
is a syntax error. Define new keywords with care! -
We make sure to reject input tokens that have our new keywords as a prefix by defining them as their literal text followed by anything that cannot be parsed as a part of an identifier,
~identifierPart
. That way, the compiler doesn’t get confused by, say,fivetimes
orfive_more
, which remain valid identifiers. -
By making sure to extend
keyword
, tooling such as syntax highlighters can automatically take advantage of our extension, if they are given our extended grammar.
Translating source code using the new grammar
First, require the ohm-js
NPM module and its included ES5 grammar:
Next, load our extended grammar from its definition in for5.ohm
, and
compile it. When we compile the grammar, we pass in a namespace that
makes the ES5 grammar available under the name our grammar expects,
ES5
:
Finally, we define the translation from our extended language to plain
ES5. To do this, we extend a semantic function, modifiedSource
,
adding a method for each new production rule. Ohm automatically uses
defaults for rules not mentioned in our extension.
Each parameter to the IterationStatement_for5_named
method is a
syntax tree node corresponding positionally to one of the tokens in
the definition of the parsing rule. Accessing the asES5
attribute of
a syntax tree node computes its translated source code. This is done
with recursive calls to the modifiedSource
attribute where required.
Our compiler is, at this point, complete. To use it, we need code to feed it input and print the results:
That’s it!
Discussion
This style of syntactic extension is quite coarse-grained: we must
translate whole compilation units at once, and must specify our
extensions separately from the code making use of them. There is no
way of adding a local syntax extension scoped precisely to a block
of code that needs it (known to Schemers as
let-syntax
).
For Javascript, sweet.js offers a more Schemely
style of syntax extension than the one explored in this post.
Mention of sweet.js leads me to the thorny topic of hygiene. Ohm is a parsing toolkit. It lets you define new concrete syntax, but doesn’t know anything about scope, or about how you intend to use identifiers. After all, it can be used for languages that don’t necessarily even have identifiers. So when we write extensions in the style I’ve presented here, we must write our translations carefully to avoid unwanted capture of identifiers. This is a tradeoff: the broad generality of Ohm’s parsing in exchange for less automation in identifier handling.
Ohm’s extensible grammars let us extend any part of the language, not just statements or expressions. We can specify new comment syntax, new string syntax, new formal argument list syntax, and so on. Because Ohm is based on parsing expression grammars, it offers scannerless parsing. Altering or extending a language’s lexical syntax is just as easy as altering its grammar.
Conclusion
We have defined an Ohm-based compiler for an extension to ES5 syntax, using only a few lines of code. Each new production rule requires, roughly, one line of grammar definition, and a short method defining its translation into simpler constructs.
You can try out this little compiler, and maybe experiment with your own extensions, by cloning its Github repo.