Heng has posted 1 posts at DZone. View Full User Profile

CookCC - Unique Lexer/Parser Generator using Java Annotation

  • submit to reddit

CookCC is a lexer/parser (LALR (1)) generator project under BSD license.


The unique feature of CookCC is that one can directly specify lexer patterns and parser grammars within the Java code using Java annotations. The result is a much easier and more straightforward way of designing lexer/parsers.

Traditionally, writing lexer / parser using code generators requires using proprietary file format. As the result, one could not use advanced Java IDE features such as syntax highlighting, error checking, code template and context sensitive helps etc. Using XML only made it possible to deal with different languages, but it does not simplify the task of writing the code. The result is that the code written in these formats are difficult to write, difficult maintain and difficult to document.

The Java Annotation approach makes it possible to utilize the best / modern Java IDEs features. At the same time, it allows refactoring, reference finding etc. Additionally, the java annotation is not required for runtime. The generated code can be run under earlier versions of JVM (prior 1.5).

Significant efforts have been made to generate fast and efficient codes that are optimized for all conditions. The result is a fast start up and fast running time. CookCC is about twice as fast as JFlex in fast word count example.

Other features include providing detail lexer pattern, and parser grammar warnings / analysis. Ant task has been provided.

Bug reports, Comments and Suggestions are welcome.

Lexer only codes:


http://code.google.com/p/cookcc/source/browse/trunk/javaap_src/org/yuanheng/cookcc/input/javaap/FileHeaderScanner.java (cycally generated and used in CookCC) 

Lexer & Parser codes:


Published at DZone with permission of its author, Heng Yuan.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)


Daniel Spiewak replied on Sat, 2008/11/29 - 4:57am

The main advantage to a parser generator is the ability to specify a grammar in a declarative, easy to read format -- preferably one which is as close to BNF/EBNF as possible.  This is kinda cool, but it doesn't seem to preserve that benefit.  The grammar is highly obfuscated within a series of annotations and production logic.  I honestly can't imagine trying to build a non-trivial language with this.

Nishant Kumar replied on Sat, 2008/11/29 - 5:08am

a comparative example would have helped. I was not able to find an example in the link provided. I agree that I did not try much too.

Heng Yuan replied on Sat, 2008/11/29 - 5:19am in response to: Daniel Spiewak

I beg to differ.

Here is the scan.l from flex source. It is actually hard to read and error prone (watch out the matching currly braces). It is also difficult to look at the code without good color syntax highlighter.

    ^"%top"[[:blank:]]*"{"[[:blank:]]*{NL}    {
brace_start_line = linenum;
buf_linedir( &top_buf, infilename?infilename:"<stdin>", linenum);
brace_depth = 1;
    ^"%top".*   synerr( _("malformed '%top' directive") );

{WS} /* discard */


And here is parse.y from flex Source:


if ( ! bol_needed )
bol_needed = true;

if ( performance_report > 1 )
"'^' operator results in sub-optimal performance" );

| rule
pat = $1;
finish_rule( pat, variable_trail_rule,
headcnt, trailcnt , previous_continued_action)

if ( scon_stk_ptr > 0 )
for ( i = 1; i <= scon_stk_ptr; ++i )
scset[scon_stk[i]] =
mkbranch( scset[scon_stk[i]],
pat );


Notice that the rule is burried in the middle of other C codes. Also for $1, $2, $$ etc, type is not clear, and if you have a number arguments, the meaning of $1 can easily lost. The annotation approach let you pick meaningful parameter name with type easily visible.

Not to mention, without context senstive help, you would have to actually lookup the functions, check spelling errors etc yourself (until you compile the code).

If you are working within Eclipse / IDEA etc, the color syntax highlighting, code templates, function / parameter hint, etc are all immediately available to you.  And you can refactor the code.  Annotations are usually in a different color from the rest of the code, to me it is fairly easy to see.

Heng Yuan replied on Sat, 2008/11/29 - 5:26am in response to: Nishant Kumar

I added links to some same codes.  One of them being used internally by CookCC.

Daniel Spiewak replied on Sat, 2008/11/29 - 1:06pm in response to: Heng Yuan

It's not the rule that is burried, it's the production logic.  I personally find the flex source you gave fairly easy to read.  It's certainly not the most intuitive format in the world, but it isn't bad.  As for parse.y, I think that's a bad example.  The only production rule you showed had a single non-terminal ("rule") and a complex logic.  Naturally that will result in the conclusion that Bison burries its logic within C, negating my point.

 Most grammars are not like the one you showed.  In my experience, the parser usually has very little logic associated with it.  Well, at least it does in an LR grammar.  This advantage should be leveraged by tools by making the grammar the pre-eminent syntactic feature.  This isn't an option when embedding a DSL within Java, thus we have the situation with CookCC where the grammar is overwhelmed by the production logic.

Heng Yuan replied on Sat, 2008/11/29 - 4:20pm in response to: Daniel Spiewak

For lexer code, my experience is that there are lots of logic in there.  Plus, I know many people do not want to leave the Java developement to switch to another programming language (in this case *.l file) to code their logic.  As I mentioned, there are problems with color syntax highlighting, refactoring etc.  So coding in Java is better.  Also, if you look at the lexer component of the Calculator example, the amount of coding is actually less in Java (since I did consider the situation where we merely try to recognize keywords / operators).

For parser code, it is true that logic is usually not a lot, even less if tree builder is being used.  However, if logic is less, the parser logic code CookCC will not be much neither they will obsecure the view.  CookCC can generate analysis text (which would have complete grammar, token, states etc in there) and generate XML grammar (set the output language to xml instead of java) if you want a condensed view of the grammar.  For tree building, which is not in CookCC, my guess for the upcoming implementation is that it will look very much like the lexer part where many grammars of a single non-terminal are compacted together, so it will not be a problem.

From my experience, one of the main trouble using lex/yacc is that these two tools are not integrated.  So people go back-n-forth between the two.  Include token file, re-generate token file etc.  Other annoyances nclude such no color highlighting, instance error checking for function / names, and function hints.  These are the problems CookCC is trying to solve.

One of the target audiences of CookCC are java Pattern class users, who heavily rely on it to do the job.  The problem with such approach is a) slow b) difficult to write a single gigantic pattern c) complex logic.  Using CookCC is far easier in dealing with text input w/o leaving the Java IDE.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.