Javascript syntax extensions using Ohm (eighty-twenty news)

Programming language designers often need to experiment with syntax for their new language features. When it comes to Javascript, we rely on language preprocessors, since altering a Javascript engine directly is out of the question if we want our experiments to escape the lab.

Ohm is “a library and domain-specific language for parsing and pattern matching.” In this post, I’m going to use it as a Javascript language preprocessor. I’ll build a simple compiler for ES5 extended with a new kind of for loop, using Ohm and the ES5 grammar included with it.

All the code in this post is available in a Github repo.

Our toy extension: “for five”

We will add a “for five” statement to ES5, which will let us write programs like this:

for five as x { console.log("We have had", x, "iterations so far"); }

The new construct simply runs its body five times in a row, binding a loop variable in the body. Running the program above through our compiler produces:

for (var x = 0; x < 5; x++) { console.log("We have had", x, "iterations so far"); }

Extending the ES5 grammar

We write our extension to the ES5 grammar in a new file for5.ohm as follows:

For5 <: ES5 {
  IterationStatement += for five as identifier Statement  -- for5_named

  five = "five" ~identifierPart
  as = "as" ~identifierPart

  keyword += five
           | as
}

Let’s take this a piece at a time. First of all, the declaration For5 <: ES5 tells Ohm that the new grammar should be called For5, and that it inherits from a grammar called ES5. Next,

  IterationStatement += for five as identifier Statement  -- for5_named

extends the existing ES5 grammar’s IterationStatement nonterminal with a new production that will be called IterationStatement_for5_named.

Finally, we define two new nonterminals as convenient shorthands for parsing the two new keywords, and augment the existing keyword definition:

five = "five" ~identifierPart
as = "as" ~identifierPart

keyword += five
         | as

There are three interesting points to be made about keywords:

First of all, making something a keyword rules it out as an identifier. In our extended language, writing var five = 5 is a syntax error. Define new keywords with care!
We make sure to reject input tokens that have our new keywords as a prefix by defining them as their literal text followed by anything that cannot be parsed as a part of an identifier, ~identifierPart. That way, the compiler doesn’t get confused by, say, fivetimes or five_more, which remain valid identifiers.
By making sure to extend keyword, tooling such as syntax highlighters can automatically take advantage of our extension, if they are given our extended grammar.

Translating source code using the new grammar

First, require the ohm-js NPM module and its included ES5 grammar:

var ohm = require('ohm-js');
var ES5 = require('ohm-js/examples/ecmascript/es5.js');

Next, load our extended grammar from its definition in for5.ohm, and compile it. When we compile the grammar, we pass in a namespace that makes the ES5 grammar available under the name our grammar expects, ES5:

var grammarSource = fs.readFileSync(path.join(__dirname, 'for5.ohm')).toString();
var grammar = ohm.grammar(grammarSource, { ES5: ES5.grammar });

Finally, we define the translation from our extended language to plain ES5. To do this, we extend a semantic function, modifiedSource, adding a method for each new production rule. Ohm automatically uses defaults for rules not mentioned in our extension.

var semantics = grammar.extendSemantics(ES5.semantics);
semantics.extendAttribute('modifiedSource', {
  IterationStatement_for5_named: function(_for, _five, _as, id, body) {
    var c = id.asES5;
    return 'for (var '+c+' = 0; '+c+' < 5; '+c+'++) ' + body.asES5;
  }
});

Each parameter to the IterationStatement_for5_named method is a syntax tree node corresponding positionally to one of the tokens in the definition of the parsing rule. Accessing the asES5 attribute of a syntax tree node computes its translated source code. This is done with recursive calls to the modifiedSource attribute where required.

Our compiler is, at this point, complete. To use it, we need code to feed it input and print the results:

function compileExtendedSource(inputSource) {
  var parseResult = grammar.match(inputSource);
  if (parseResult.failed()) console.error(parseResult.message);
  return parseResult.succeeded() && semantics(parseResult).asES5;
}

That’s it!

> compileExtendedSource("for five as x { console.log(x); }");
'for (var x = 0; x < 5; x++) { console.log(x); }'

Discussion

This style of syntactic extension is quite coarse-grained: we must translate whole compilation units at once, and must specify our extensions separately from the code making use of them. There is no way of adding a local syntax extension scoped precisely to a block of code that needs it (known to Schemers as let-syntax). For Javascript, sweet.js offers a more Schemely style of syntax extension than the one explored in this post.

Mention of sweet.js leads me to the thorny topic of hygiene. Ohm is a parsing toolkit. It lets you define new concrete syntax, but doesn’t know anything about scope, or about how you intend to use identifiers. After all, it can be used for languages that don’t necessarily even have identifiers. So when we write extensions in the style I’ve presented here, we must write our translations carefully to avoid unwanted capture of identifiers. This is a tradeoff: the broad generality of Ohm’s parsing in exchange for less automation in identifier handling.

Ohm’s extensible grammars let us extend any part of the language, not just statements or expressions. We can specify new comment syntax, new string syntax, new formal argument list syntax, and so on. Because Ohm is based on parsing expression grammars, it offers scannerless parsing. Altering or extending a language’s lexical syntax is just as easy as altering its grammar.

Conclusion

We have defined an Ohm-based compiler for an extension to ES5 syntax, using only a few lines of code. Each new production rule requires, roughly, one line of grammar definition, and a short method defining its translation into simpler constructs.

You can try out this little compiler, and maybe experiment with your own extensions, by cloning its Github repo.