ANTLR via Clojure

ANTLR is a parser-generator for Java. Can you use it from Clojure? Sure. Would you want to? Maybe.

Here's how to do it, start to finish.

For the impatient among you, all of the code below is on github.

git clone git://github.com/briancarper/clojure-antlr-example.git

Setup

I'm going to use leiningen for this project. Let's make a new project called antlr-example.

$ lein new antlr-example

Now edit project.clj to tell lein to fetch ANTLR (and Swank, since I use Emacs). ANTLR is available in Maven Central, so leiningen can grab it for us.

(defproject antlr-example "1.0.0-SNAPSHOT"
  :description "ANTLR Clojure example"
  :dependencies [[org.clojure/clojure "1.2.0-beta1"]
                 [org.clojure/clojure-contrib "1.2.0-beta1"]
                 [org.antlr/antlr "3.2"]]
  :dev-dependencies [[swank-clojure "1.3.0-SNAPSHOT"]])

Then a simple lein deps will download all of the jars and all of ANTLR's dependencies for you. Handy. You end up with this:

antlr-example
├── classes
├── lib
│   ├── antlr-2.7.7.jar
│   ├── antlr-3.2.jar
│   ├── antlr-runtime-3.2.jar
│   ├── clojure-1.2.0-beta1.jar
│   ├── clojure-contrib-1.2.0-beta1.jar
│   ├── stringtemplate-3.2.jar
│   └── swank-clojure-1.3.0-20100502.112537-1.jar
├── project.clj
├── README
├── src
│   └── antlr_example
│       └── core.clj
└── test
    └── antlr_example
        └── core_test.clj

Interactive development? Kind of...

One weakness of ANTLR via Clojure is that development is no longer REPL-based. The ANTLR workflow is essentially a Java workflow:

  1. Write a grammar
  2. Compile the grammar into Java source code
  3. Compile the Java source into Java .class files

Until and unless someone writes a Clojure backend for ANTLR, so that ANTLR can directly produce Clojure source code (and it seems like it should be possible to do so), you're back to a write/compile/debug cycle. Joy!

This also means restarting your REPL every time you alter and recompile your grammar.

But the good thing is that there's ANTLR Works, a free GUI for writing ANTLR grammars. ANTLR Works has an interactive interpreter, which is nice, and it has a compiler/debugger, which is better. These let you test your grammar as you write it, by trying inputs and seeing the resulting AST in graphical form, which is as good as you could hope for. This is actually even a bit nicer than a plaintext REPL.

Plus, it has a nice editor for ANTLR code. Emacs was freaking out over the indentation on a few grammars I tried.

So a decent workflow might be to write and debug your grammar in ANTLR Works, then fire up Clojure afterwards to consume the parser.

A simple grammar

We're going to use a classic textbook example, simple arithmetic expressions, as a proof of concept.

For Java (hence Clojure) to find this code, it has to be named properly and it has to be put into the proper directory on CLASSPATH. I'm going to create the grammar file src/antlr_example/Expr.g. (My grammar file, Java source, and Clojure source will all be jumbled together in src. You can easily do it differently if this offends your sensibilities.)

The first line in the grammar names the grammar (and the classes we'll consume from Clojure):

grammar Expr;

Clojure being a s-exp based language, it might be nice to have ANTLR generate an AST. ANTLR has good support for this:

options {
    output=AST;
    ASTLabelType=CommonTree;
}

Helper tokens for the grammar:

tokens {
    PLUS = '+';
    MINUS = '-';
    MULT = '*';
    DIV = '/';
}

These next lines are important. We want to generate classes antlr_example.ExprLexer and antlr_example.ExprParser, so we have to set our code up in the proper package, antlr_example.

This code includes both a parser and lexer, so you have to set the package for both in ANTLR.

@header        {package antlr_example;}
@lexer::header {package antlr_example;}

And then the grammar itself, which is simple:

expr: term ((PLUS | MINUS)^ term)* ;
term: factor ((MULT | DIV)^ factor)* ;
factor: INT ;

INT :   '0'..'9'+ ;
WS  :   ( ' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;} ;

One thing to note is (PLUS | MINUS)^; the ^ here tells ANTLR to treat (PLUS | MINUS) as the parent of the node created for this rule. This means the term on either side will be children of this node. We do the same for MULT and DIV. This will give us a nice tree. Otherwise you end up with a flat list of tokens.

{$channel=HIDDEN;} for the whitespace rule tells ANTLR to ignore whitespace, also useful.

If you use ANTLR Works with this grammar, on the input 1 + 2 * 3 + 4, you can see that it works OK.

ANTLR

That was an awful lot of boilerplate and setup and ugly syntax though. See how much Clojure spoils you? But for a simple grammar, it's not that much code.

For a longer or more complex grammar, you might end up having to embed inline Java code into your ANTLR grammar. But again, that's not the end of the world.

Compile everything

Once you write your grammar (presumably in ANTLR Works), you can generate the Java code and compile it thusly (run from your project's base directory):

$ java -cp 'lib/*' org.antlr.Tool src/antlr_example/Expr.g
$ javac -cp 'lib/*' -d classes src/antlr_example/Expr{Lexer,Parser}.java

The first command should generate src/antlr_example/ExprLexer.java and src/antlr_example/ExprParser.java.

The second command should spew tons of class files into classes/. Leiningen and other tools generally include classes/ on your CLASSPATH, so that's a good place for them. Again putting them here is a fairly arbitrary decision on my part. You can put the source and class files anywhere, as long as the class files end up on your CLASSPATH when you start Clojure.

If you're really so lazy that you can't run these two commands, you could use ANTLR Works' built-in commands for compiling your grammar. The only way I could find to invoke it was by starting a Debug session, which compiled my code as a side-effect. And it spews files into places I don't want them, so I like the command line version better.

You could probably also set up a leiningen task for this if you're re-running these commands a lot, or configure your build tool of choice to do the same. I didn't bother.

Now (re)start your REPL and let's write some Clojure.

Clojure code

It's pretty easy to consume an ANTLR parser from Clojure. I'm putting the code below into src/antlr_example/core.clj

First import the classes. You probably need some ANTLR helper classes to set up the parser; you also need the classes you just generated.

(ns antlr-example.core
  (:import (org.antlr.runtime ANTLRStringStream
                              CommonTokenStream)
           (antlr_example ExprLexer ExprParser)))

Here's a function to parse a string using our new parser class:

(defn parse-expr [s]
  (let [lexer (ExprLexer. (ANTLRStringStream. s))
        tokens (CommonTokenStream. lexer)
        parser (ExprParser. tokens)]
    (.expr parser)))

.expr is the name of the top-level rule we want to invoke from our grammar. The rest is boilerplate; just plug in the proper class names.

So let's see what we've got.

user> (in-ns 'antlr-example.core)
#<Namespace antlr-example.core>
antlr-example.core> (parse-expr "1 + 2 * 3 + 4")
#<expr_return antlr_example.ExprParser$expr_return@3755e508>

Er, OK. A good way to inspect Java objects is via bean, so let's view the guts of this object:

antlr-example.core> (bean *1)
{:tree #<CommonTree +>,
 :template nil,
 :stop #<CommonToken [@12,12:12='4',<8>,1:12]>,
 :start #<CommonToken [@0,0:0='1',<8>,1:0]>,
 :class antlr_example.ExprParser$expr_return}

I see. We set up our grammar above to generate an AST, so :tree on the bean will give us that. This translates to .getTree on the Java object. So we can edit our function to call this for us, since we always want to have the tree.

(defn parse-expr [s]
  (let [lexer (ExprLexer. (ANTLRStringStream. s))
        tokens (CommonTokenStream. lexer)
        parser (ExprParser. tokens)]
    (.getTree (.expr parser))))

Now:

antlr-example.core> (parse-expr "1 + 2 * 3 + 4")
#<CommonTree +>

OK, it appears we have one node of the tree, the root node. Let's peak inside:

antlr-example.core> (bean *1)
{:children #<ArrayList [+, 4]>,
 :childIndex -1,
 :parent nil,
 :text "+",
 :nil false,
 :token #<CommonToken [@10,10:10='+',<4>,1:10]>,
 :class org.antlr.runtime.tree.CommonTree,
 :ancestors nil,
 :tokenStartIndex 0,
 :type 4,
 :childCount 2,
 :charPositionInLine 10,
 :tokenStopIndex 12,
 :line 1}

Looks like our children are in :children on the bean, which translates to .getChildren on the Java object. And we have .getChildCount to count the children, and .getText to get the text of the current node. (We could've learned all of this by reading the javadocs for ANTLR too, but what fun is that?)

Since we're dealing with a tree, we can use tree-seq in Clojure to get a flat list of all the tokens in our text.

tree-seq takes three arguments. First a function that returns true if the node has children, false otherwise. Second a function that returns the children for a node that has children. Third the root node of our tree.

That'll give us a seq of tree nodes. So finally we call .getText on all the resulting nodes in the list, to turn node objects into strings.

Easy enough:

(defn node-seq [x]
  (map #(.getText %)
   (tree-seq #(not (zero? (.getChildCount %)))
             #(.getChildren %)
             x)))

antlr-example.core> (node-seq (parse-expr "1 + 2 * 3 + 4"))
("+" "+" "1" "*" "2" "3" "4")

But that's no good. We lost our tree structure.

We'd rather have something like nested vectors or lists. It's easy enough to roll something by hand. This should do it:

(defn AST [node]
  (if (zero? (.getChildCount node))
    (.getText node)
    (let [children (map AST (.getChildren node))
          txt (.getText node)]
      (if txt
        (cons txt children)
        children))))

antlr-example.core> (AST (parse-expr "1 + 2 * 3 + 4"))
("+" ("+" "1" ("*" "2" "3")) "4")

That's better, but it's a list of strings. These strings all happen to be literal representations of Clojure objects, so a call to read-string in the proper places should give us something we can work with:

(defn AST [node]
  (if (zero? (.getChildCount node))
    (read-string (.getText node))
    (let [children (map AST (.getChildren node))
          txt (read-string (.getText node))]
      (if txt
        (apply list txt children)
        children))))

antlr-example.core> (AST (parse-expr "1 + 2 * 3 + 4"))
(+ (+ 1 (* 2 3)) 4)

Hey look, now we have something we can evaluate.

antlr-example.core> (eval *1)
11

Yeah I kind of planned that ahead. If you had a more complex grammar, you might not get away with something quite this easy.

Conclusion

The advantage of ANTLR is that it's mature, widely used, actively developed, and well-documented. (There's a whole book about ANTLR, The Definitive ANTLR Reference by Terence Parr.) There's also a lot of tooling available for it, not just ANTLR Works, but plugins for other Java IDEs.

The disadvantage is that it's a Java library, and as always, there will be some friction when consuming it from Clojure. But it's not that bad.

For pure-Clojure parser generator alternatives, there are fnparse and clojure-pg, neither of which I've tried much.

July 16, 2010 @ 10:41 AM PDT
Cateogory: Programming

5 Comments

Shawn Hoover
Quoth Shawn Hoover on July 19, 2010 @ 1:42 AM PDT

That's cool. It seems on the Clojure list people often recommend skipping the lexer/parser and just implementing DSLs in sexps, but I can see this being useful if you really want to provide a non-sexp language. The resulting tree grammar could then be reusable in any other platform supported by ANTLR.

Marko
Quoth Marko on July 20, 2010 @ 12:15 AM PDT

For those wanting to explore a more functional approach to parsing:

http://github.com/mmikulicic/clarsec

Alex Hall
Quoth Alex Hall on December 23, 2010 @ 8:46 AM PST

Found this post as I was researching Clojure-ANTLR integration for my own project. I've just finished work on my own Leiningen plugin to process ANTLR grammars:

http://github.com/alexhall/lein-antlr

It's not as easy as working in the REPL, but a lot easier than working manually through the ANTLR tools.

Brian
Quoth Brian on January 04, 2011 @ 4:22 AM PST

Awesome idea. It does look easier than the standard ANTLR tools. Thanks for sharing.

Colin Yates
Quoth Colin Yates on July 08, 2011 @ 1:16 AM PDT

Useful post - ta.

P.S. Cntrl+Shift + G compiles in ANTLWorks.

Speak your Mind

You can use Markdown in your comment.
Email/URL are optional. Email is only used for Gravatar.

Preview