We are trying to create an off-side rule parser, but have problems with multiple UNINDENT.

View: New views
2 Messages — Rating Filter:   Alert me  

We are trying to create an off-side rule parser, but have problems with multiple UNINDENT.

by chris heathcott :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

We are trying to create an off-side rule parser that uses indentation (whitespace) rather than braces to delimit blocks in the abstract syntax.  Ultimately, we'll simplify the concrete syntax to be more like python/haskell, but right now we are using a concrete syntax similar to that of C, with the exception of INDENT and UNINDENT tokens to delimit blocks rather than brace tokens.

In the token manager declarations, we setup a stack to keep track of indentation.  The stack stores the length of the whitespace comprising indentation as ints, and corresponding peek, pop and push methods are defined.

We define a DEFAULT lexical state in which the lexar begins, and we also switch back to the DEFAULT state from the IN_LINE state whenever we encounter an end of line in the latter state.  In the DEFAULT state, we skip empty lines, and we use MORE to capture all the whitespace comprising the indentation.

After matching all the spaces or tabs in the current indentation, we have an action associated with an empty string match that determines if the current line is indented, unindented or the same indentation as the previous line.  We compare the length of the whitespace of the current line to the value at the top of the stack.
 
If the length of the whitespace of the current line is greater than the value on the top of the stack, then an indent occurred, and we push the length on the stack, and we switch to the INDENTING lexar state.  In this state, an empty string match causes the lexar to send an INDENT token to the parser, and this works in our tests.

If the length of the whitespace of the current line is less than the value on the top of the stack, then an unindent occurred, and we pop the stack until the value on the top of the stack is the same as the length of the current whitespace.  Each time we pop, our goal is to send an UNINDENT token to the parser, but we are presently unable to send more than one UNINDENT token.  Strangely, an UNINDENT token is sent only after the last pop, whereas it would seem that this would happen after the first pop.

The second problem is that the parser becomes stuck in an infinite loop of empty string matches in this state.  This is also strange because the UNINDENT token is only matched after the last pop, but also because after the last pop, the length of the whitespace of the current line is equal to the value on the top of the stack, so the lexar should break out of the while loop in the action, and the last else clause in the action should switch back to the IN_LINE lexar state.

-------------------------------------------------------------------------------

Excerpt from Parser.jj:

TOKEN_MGR_DECLS:
{
    Integer[] init_array = new Integer[] {new Integer(0)};
    java.util.ArrayList<Integer> stack =
        new java.util.ArrayList<Integer>(java.util.Arrays.asList(init_array));
    int peek() { return stack.get(stack.size() - 1); }
    int pop() { return stack.remove(stack.size() - 1); }
    void push(int i) { stack.add(i);}
}

<DEFAULT>
SKIP:
{
    "\r" // { System.out.print("(DEFAULT STATE - SKIP: CR)"); }
  | "\n" // { System.out.print("(DEFAULT STATE - SKIP: NL)"); }
}

<DEFAULT>
MORE:
{
    " " // { System.out.print("(DEFAULT STATE - MORE: S)"); }
  | "\t" // { System.out.print("(DEFAULT STATE - MORE: T)"); }
  | ""
    {
        if (image.length() > peek())
        {
            push(image.length());
            System.out.print("(push " + stack + ")");
            SwitchTo(INDENTING);
        }
        else if (image.length() < peek())
        {
            while (image.length() < peek())
            {
                pop();
                System.out.print("(pop " + stack +")");
                SwitchTo(UNINDENTING);
            }
            if (image.length() != peek())
            {
                //throw new TokenMgrError("Improper unindent.",
                //TokenMgrError.LEXICAL_ERROR);
                System.out.print("Error: Improper unindent.");
            }
        }
        else
        {
            SwitchTo(IN_LINE);
        }
    }
}

<INDENTING>
TOKEN:
{
    <INDENT: ""> { System.out.print("<INDENT>"); } : IN_LINE
}

<UNINDENTING>
TOKEN:
{
    <UNINDENT: ""> { System.out.print("<UNINDENT>"); } : DEFAULT
}

<IN_LINE>
SKIP:
{
  | " "
  | "\t"
  | "\r"
}

<IN_LINE>
TOKEN:
{
 
  /* ALL THE REGULAR TOKENS */

  | <EOL:       "\n"> { System.out.print("<EOL>\n"); } : DEFAULT
}

-------------------------------------------------------------------------------

Sample input:

int main ()
    if (true)
        return 1;
int i;

-------------------------------------------------------------------------------

Sample output:

bash-3.2$ ant && java -cp bin proj.Hmm test1.py
Buildfile: build.xml

printJavaVersion:
     [echo]        YOU ARE USING JAVA 1.6.0_11
     [echo]
     [echo]   Java located at: /lusr/opt/jdk1.6.0_11/jre

all:
   [javacc] Java Compiler Compiler Version 4.1 (Parser Generator)
   [javacc] (type "javacc" with no arguments for help)
   [javacc] Reading from file /v/filer4b/v38q001/cjh/cs345/project/proj/src/proj/parser/Parser.jj . . .
   [javacc] Warning: Line 43, Column 5: Regular expression can be matched by the empty string ("") in lexical state DEFAULT. This can result in an endless loop of empty string matches.
   [javacc] Warning: Line 86, Column 5: Regular expression for UNINDENT can be matched by the empty string ("") in lexical state UNINDENTING. This regular expression along with the regular expressions at line 43, column 5 forms the cycle
   [javacc]    UNINDENTING-->DEFAULT-->DEFAULT
   [javacc] containing regular expressions with empty matches. This can result in an endless loop of empty string matches.
   [javacc] File "TokenMgrError.java" is being rebuilt.
   [javacc] File "ParseException.java" is being rebuilt.
   [javacc] File "Token.java" is being rebuilt.
   [javacc] File "SimpleCharStream.java" is being rebuilt.
   [javacc] Parser generated with 0 errors and 2 warnings.
    [javac] Compiling 7 source files to /v/filer4b/v38q001/cjh/cs345/project/proj/bin
    [javac] Note: /v/filer4b/v38q001/cjh/cs345/project/proj/src/proj/parser/Parser.java uses unchecked or unsafe operations.
    [javac] Note: Recompile with -Xlint:unchecked for details.

BUILD SUCCESSFUL
Total time: 3 seconds
(program - start)<INT><IDENTIFIER><LPAREN><RPAREN><EOL>
(block - start)(push [0, 4])<INDENT><IF><LPAREN><RPAREN><EOL>
(block - start)(push [0, 4, 8])<INDENT><RETURN><INTEGER><SEMI>(return - end)<EOL>
(pop [0, 4])(pop [0])<UNINDENT>(block - end)Exception in thread "main" proj.parser.TokenMgrError: Error: Bailing out of infinite loop caused by repeated empty string matches at line 4, column 1.
        at proj.parser.ParserTokenManager.MoreLexicalActions(Unknown Source)
        at proj.parser.ParserTokenManager.getNextToken(Unknown Source)
        at proj.parser.Parser.jj_ntk(Unknown Source)
        at proj.parser.Parser.ifStatement(Unknown Source)
        at proj.parser.Parser.statement(Unknown Source)
        at proj.parser.Parser.statements(Unknown Source)
        at proj.parser.Parser.block(Unknown Source)
        at proj.parser.Parser.restOfFunction(Unknown Source)
        at proj.parser.Parser.Program(Unknown Source)
        at proj.Hmm.main(Unknown Source)

Re: We are trying to create an off-side rule parser, but have problems with multiple UNINDENT.

by Sreenivasa Viswanadha :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


I don't remember now why I did it, but I had to abandon this. So instead,
we did it using semantic lookahead. Basically, just ignore all white space
as you normally do, setTabSize(1) and then do something like the
following.

Hope that helps,
Sreeni.


   try
   {
      LOOKAHEAD({getToken(0).beginLine == getToken(1).beginLine})
      stmt_list() /*<NEWLINE>*/ // multiple statements on the same line
      |

       (
         { indent = getToken(1).beginColumn; }
         (
            LOOKAHEAD({ getToken(1).kind != EOF &&
                        indent == getToken(1).beginColumn })
            statement()
         ) +
       )
   }
   finally
   {
      indent = oldIndent;
   }



> We are trying to create an off-side rule parser that uses indentation
> (whitespace) rather than braces to delimit blocks in the abstract syntax.
> Ultimately, we'll simplify the concrete syntax to be more like
> python/haskell, but right now we are using a concrete syntax similar to
> that
> of C, with the exception of INDENT and UNINDENT tokens to delimit blocks
> rather than brace tokens.
>
> In the token manager declarations, we setup a stack to keep track of
> indentation.  The stack stores the length of the whitespace comprising
> indentation as ints, and corresponding peek, pop and push methods are
> defined.
>
> We define a DEFAULT lexical state in which the lexar begins, and we also
> switch back to the DEFAULT state from the IN_LINE state whenever we
> encounter an end of line in the latter state.  In the DEFAULT state, we
> skip
> empty lines, and we use MORE to capture all the whitespace comprising the
> indentation.
>
> After matching all the spaces or tabs in the current indentation, we have
> an
> action associated with an empty string match that determines if the
> current
> line is indented, unindented or the same indentation as the previous line.
> We compare the length of the whitespace of the current line to the value
> at
> the top of the stack.
>
> If the length of the whitespace of the current line is greater than the
> value on the top of the stack, then an indent occurred, and we push the
> length on the stack, and we switch to the INDENTING lexar state.  In this
> state, an empty string match causes the lexar to send an INDENT token to
> the
> parser, and this works in our tests.
>
> If the length of the whitespace of the current line is less than the value
> on the top of the stack, then an unindent occurred, and we pop the stack
> until the value on the top of the stack is the same as the length of the
> current whitespace.  Each time we pop, our goal is to send an UNINDENT
> token
> to the parser, but we are presently unable to send more than one UNINDENT
> token.  Strangely, an UNINDENT token is sent only after the last pop,
> whereas it would seem that this would happen after the first pop.
>
> The second problem is that the parser becomes stuck in an infinite loop of
> empty string matches in this state.  This is also strange because the
> UNINDENT token is only matched after the last pop, but also because after
> the last pop, the length of the whitespace of the current line is equal to
> the value on the top of the stack, so the lexar should break out of the
> while loop in the action, and the last else clause in the action should
> switch back to the IN_LINE lexar state.
>
> -------------------------------------------------------------------------------
>
> Excerpt from Parser.jj:
>
> TOKEN_MGR_DECLS:
> {
>     Integer[] init_array = new Integer[] {new Integer(0)};
>     java.util.ArrayList<Integer> stack =
>         new
> java.util.ArrayList<Integer>(java.util.Arrays.asList(init_array));
>     int peek() { return stack.get(stack.size() - 1); }
>     int pop() { return stack.remove(stack.size() - 1); }
>     void push(int i) { stack.add(i);}
> }
>
> <DEFAULT>
> SKIP:
> {
>     "\r" // { System.out.print("(DEFAULT STATE - SKIP: CR)"); }
>   | "\n" // { System.out.print("(DEFAULT STATE - SKIP: NL)"); }
> }
>
> <DEFAULT>
> MORE:
> {
>     " " // { System.out.print("(DEFAULT STATE - MORE: S)"); }
>   | "\t" // { System.out.print("(DEFAULT STATE - MORE: T)"); }
>   | ""
>     {
>         if (image.length() > peek())
>         {
>             push(image.length());
>             System.out.print("(push " + stack + ")");
>             SwitchTo(INDENTING);
>         }
>         else if (image.length() < peek())
>         {
>             while (image.length() < peek())
>             {
>                 pop();
>                 System.out.print("(pop " + stack +")");
>                 SwitchTo(UNINDENTING);
>             }
>             if (image.length() != peek())
>             {
>                 //throw new TokenMgrError("Improper unindent.",
>                 //TokenMgrError.LEXICAL_ERROR);
>                 System.out.print("Error: Improper unindent.");
>             }
>         }
>         else
>         {
>             SwitchTo(IN_LINE);
>         }
>     }
> }
>
> <INDENTING>
> TOKEN:
> {
>     <INDENT: ""> { System.out.print("<INDENT>"); } : IN_LINE
> }
>
> <UNINDENTING>
> TOKEN:
> {
>     <UNINDENT: ""> { System.out.print("<UNINDENT>"); } : DEFAULT
> }
>
> <IN_LINE>
> SKIP:
> {
>   | " "
>   | "\t"
>   | "\r"
> }
>
> <IN_LINE>
> TOKEN:
> {
>
>   /* ALL THE REGULAR TOKENS */
>
>   | <EOL:       "\n"> { System.out.print("<EOL>\n"); } : DEFAULT
> }
>
> -------------------------------------------------------------------------------
>
> Sample input:
>
> int main ()
>     if (true)
>         return 1;
> int i;
>
> -------------------------------------------------------------------------------
>
> Sample output:
>
> bash-3.2$ ant && java -cp bin proj.Hmm test1.py
> Buildfile: build.xml
>
> printJavaVersion:
>      [echo]        YOU ARE USING JAVA 1.6.0_11
>      [echo]
>      [echo]   Java located at: /lusr/opt/jdk1.6.0_11/jre
>
> all:
>    [javacc] Java Compiler Compiler Version 4.1 (Parser Generator)
>    [javacc] (type "javacc" with no arguments for help)
>    [javacc] Reading from file
> /v/filer4b/v38q001/cjh/cs345/project/proj/src/proj/parser/Parser.jj . . .
>    [javacc] Warning: Line 43, Column 5: Regular expression can be matched
> by
> the empty string ("") in lexical state DEFAULT. This can result in an
> endless loop of empty string matches.
>    [javacc] Warning: Line 86, Column 5: Regular expression for UNINDENT
> can
> be matched by the empty string ("") in lexical state UNINDENTING. This
> regular expression along with the regular expressions at line 43, column 5
> forms the cycle
>    [javacc]    UNINDENTING-->DEFAULT-->DEFAULT
>    [javacc] containing regular expressions with empty matches. This can
> result in an endless loop of empty string matches.
>    [javacc] File "TokenMgrError.java" is being rebuilt.
>    [javacc] File "ParseException.java" is being rebuilt.
>    [javacc] File "Token.java" is being rebuilt.
>    [javacc] File "SimpleCharStream.java" is being rebuilt.
>    [javacc] Parser generated with 0 errors and 2 warnings.
>     [javac] Compiling 7 source files to
> /v/filer4b/v38q001/cjh/cs345/project/proj/bin
>     [javac] Note:
> /v/filer4b/v38q001/cjh/cs345/project/proj/src/proj/parser/Parser.java uses
> unchecked or unsafe operations.
>     [javac] Note: Recompile with -Xlint:unchecked for details.
>
> BUILD SUCCESSFUL
> Total time: 3 seconds
> (program - start)<INT><IDENTIFIER><LPAREN><RPAREN><EOL>
> (block - start)(push [0, 4])<INDENT><IF><LPAREN><RPAREN><EOL>
> (block - start)(push [0, 4, 8])<INDENT><RETURN><INTEGER><SEMI>(return -
> end)<EOL>
> (pop [0, 4])(pop [0])<UNINDENT>(block - end)Exception in thread "main"
> proj.parser.TokenMgrError: Error: Bailing out of infinite loop caused by
> repeated empty string matches at line 4, column 1.
>         at proj.parser.ParserTokenManager.MoreLexicalActions(Unknown
> Source)
>         at proj.parser.ParserTokenManager.getNextToken(Unknown Source)
>         at proj.parser.Parser.jj_ntk(Unknown Source)
>         at proj.parser.Parser.ifStatement(Unknown Source)
>         at proj.parser.Parser.statement(Unknown Source)
>         at proj.parser.Parser.statements(Unknown Source)
>         at proj.parser.Parser.block(Unknown Source)
>         at proj.parser.Parser.restOfFunction(Unknown Source)
>         at proj.parser.Parser.Program(Unknown Source)
>         at proj.Hmm.main(Unknown Source)
>



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...