SableCC 4-beta.1

View: New views
1 Messages — Rating Filter:   Alert me  

SableCC 4-beta.1

by Etienne M. Gagnon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi all,

It is with great pleasure that I announce the release of the first beta version of SableCC 4. It is far from complete, but I wanted to let you start playing with some of the new features.

This version generates lexers (and related token classes) using the new and powerful lexical expressions based engine. It also supports the new flexible syntax for defining tokens. I'll discuss all of these things below.

First, let me warn you of a few limitations:
  1. Parsers are not yet supported.
  2. The generation engine is a quick hack (using ObjectMacro, though) and it only targets Java. The generated code is far from definitive.
  3. The accepted lexer specification syntax, while offering much improvement over SableCC 3, does not yet support all the planed features. In particular: contexts, groups, investigators, selectors, and syntactic tokens are not yet supported.
  4. There's no specific support for file encoding, etc. Generated lexers use a simple Reader to get a sequence of Java "char" characters. The internal lexer generation engine, though, is not limited by any numerical range.
  5. SableCC 4 generates a nice object-oriented state machine, but it could generate somewhat better code, e.g. use binary search to match a char to a symbol.
OK, now that you're warned, I guess that some of you are still eager to start playing with the beast.

OK, so what's so neat about SableCC 4 lexers? There's nothing like an example to show you.
Language demo;

Lexer

  // some named expressions

  letter = upper_case | 'a'..'z';
  upper_case = 'A'..'Z'; // defined after letter
  digit = '0'..'9';

  blank = (' ' | #xA | #13)+;

  number = digit+;

  while = 'while'; // some keyword

  identifier = letter (letter | digit)*;

  for = 'for'; // another keyword (defined after identifier)

  ip_num = digit^(1..3);
  ip_address = (ip_num Separator '.')^4;

  zero = '0'+ | 'zero';

  Token
    for, number, identifier, while, 'else', ip_address, zero;

  Ignored
    blank;

  Priority
    zero > number;
    zero > identifier; 
So, here are some of the features above:
  1. The declaration order of named expressions does not matter. You define them in whatever order and semantics won't change. No more dreaded "keyword hidden by identifier" bugs to fight.
  2. SableCC is able to figure out the "natural" priority between tokens. For example, it was able to figure that "while" and "if" tokens have priority over "identifier".
  3. In unnatural cases, SableCC reports a priority error, unless an explicit priority is given as shown above, e.g. "zero" wins over "number" and "identifier".
  4. The exponent syntax (discussed on this list a while ago) is supported.
  5. Anonymous (i.e. inlined) tokens are supported. See the 'else' declaration in the Token declaration.
  6. Characters can be expressed in decimal (#123) and hexadecimal (#x9ab) notation.
You probably noticed that I used the name "lexical expressions" at the beginning of this message. A lexical expression is a superset of a regular expression where two lexical operators have been added: 'Look' (lookahead) and '-' (lexical subtraction).

These two operators can be used anywhere within an expression, not only at the top level, and they do deliver the expected semantics: the longest token is always matched (without taking lookahead into account), and scanning is done in linear time O(input length).

How can they be used? Here are some examples:
Language demo2;

Lexer

  // line comment
  line_comment = '//' (Shortest (Any* Look (eol | '' Look End))) (eol | '' Look End);

  // c comment
  c_comment = '/*' (Any* - '*/') '*/';
The lexical subtraction operator '-' is not a regular expression "difference" operator. e.g.
  • Any* Diff '*/' means: All strings except the single string '*/'.
  • Any* - '*/' means: All strings except those that contain any part of a '*/' character sequence in the input, while taking lookahead into account.
Using the Diff operator, one would have to write Any* Diff (Any* '*/' Any*) to express the body of a C comment, yet, that wouldn't even work. Given the input stream:
abc*/defg...
this expression would match 'abc*', as it is the longest matching string! In general, we would like to match 'abc'; that's what the lexical subtraction operator does.

You can use the operators as you want:
  dummy = ('a' Look 'a'* 'b')*;
This expression will match 0 or more 'a', where each 'a' is followed by 0 or more 'a' and then 'b'.

There's also a 'Look Not' to match when not followed by an expression.
 dummy2 = 'b' Look Not End;  // b, when not followed by EOF

Oh yes... SableCC automatically generates a small Test.java application to test your lexer (in the language_languageName package), to save you some typing.

So, there it is. Please play with it and report back your comments and suggestions to this list.

Have fun!

Etienne

-- 
Etienne M. Gagnon, Ph.D.
SableCC:                                            http://sablecc.org

_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion