Wednesday, May 27, 2009

ANTLR: an exercise in pain.

Ok, so for my current project I need to either build heuristic or machine learning based fuzzy parser. As someone who has written numerous standard parsers before, this qualifies as interesting. Current approaches I'm considering include cascading concrete grammars; stochastic context-free grammars; and various forms of hidden-markov-model based recognisers. Whatever approach ends up working best, the first stage for all is a scanner.

So I start building a jflex lexer, and make reasonable progress when I find out that we are already using ANTLR for other projects, so I should probably use it as well. Having experienced mulgara's peak of 5 different parser generators - this does eventually become ridiculous - I was more than willing to use ANTLR. Yes it's LL, and yes I have previously discussed my preference for LR; but, it does support attaching semantic actions to productions, so my primary requirement of a parser-generator is met; and anyway, it has an excellent reputation, and a substantial active community.

What I am now stunned by, is just how bad the documentation can be for such a popular tool. One almost non-existent wiki; a FAQ; and a woefully incomplete doxygen dump, does not substitute for a reference. ANTLR has worse documentation than sablecc had when it consisted of an appendix to a masters thesis!

My conclusion: If you have any choice don't use ANTLR. For Java: if you must use LL I currently recommend JavaCC; if you can use an LALR parser do so, my current preference is Beaver.