Possible Lex Bug

View: New views
1 Messages — Rating Filter:   Alert me  

Possible Lex Bug

by Kurt Guenther :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


I think I may have a possible bug.  I cut it down to a small parser that
still shows the problem (below).   The problem seems to be in switching
lexical states with MORE.  (or, I just don't understand more).  

Here's the input file:


---- start -----
(1) GENERAL INFORMATION:
     (i) APPLICANT:  kurt guenther
    (ii) TITLE OF INVENTION:  fusion reactor
---- end  -----

The output and lexer is below.

The problem arises on line #3 with the curchar being "i".    On Line #2,
it correctly enters the TEXT state to parse <TEXT_LINE> and <TEXT_EOL>.  
It looks like more detects the parenthesis correctly and switches state
to DEFAULT.   But, the lexer doesn't seem to use the open paren "(" for
the token match.

I have a junit, parser, sample file.   Let me know and I'll send a zip.

I'll take out the more and use skips for the time being.

--Kurt



Here is the output:

-------  start ------------------
****** FOUND A <TEXT_EOL> MATCH (\r\n) ******

      Consumed token: <<TEXT_EOL>: "
" at line 2 column 35>
<TEXT>Skipping character :   (32)
<TEXT>Skipping character :   (32)
<TEXT>Skipping character :   (32)
<TEXT>Skipping character :   (32)
<TEXT>Current character : ( (40) at line 3 column 5
   No more string literal token matches are possible.
   Currently matched the first 1 characters as a "(" token.
****** FOUND A "(" MATCH (() ******

 >>>>>>>>>>>Switching to DEFAULT
<DEFAULT>Current character : i (105) at line 3 column 6
<DEFAULT>Current character : i (105) at line 3 column 6
   No string literal matches possible.
   Starting NFA to match one of : { <EOL>, <TAG_GENERAL_INFO> }
<DEFAULT>Current character : i (105) at line 3 column 6
    Return: parseText
  Return: parseGeneralInfo
Return: Document
-----------------------end-----------------------


Here's the lexer:


--------------- start ------------------
<*> TOKEN :
{
    <EOF>
}

<DEFAULT> SKIP :
{
    " "
|    "\t"
}

<DEFAULT> TOKEN :
{
    <EOL:              "\n" | "\r" | "\r\n" >
|   <#INTEGER:         ["1"-"9"] (["0"-"9"])* >
|   <#WHITESPACE:   ([" ","\t"])+ >
}


< DEFAULT > TOKEN :
{
    <TAG_COLON:            ":" >
|    <TAG_GENERAL_INFO:                "(1)" <WHITESPACE> "GENERAL
INFORMATION:" >        : DEFAULT
|    <TAG_APPLICANT:                 "(i)" <WHITESPACE> "APPLICANT:" >  
                : TEXT
|     <TAG_TITLE_OF_INVENTION:        "(ii)" <WHITESPACE> "TITLE OF
INVENTION:" >     : TEXT

}

< TEXT > SKIP :
{
    " "
|      "\t"
}

<TEXT> MORE :
{
    // this should only trigger when found as a first character on the line
    "(" {
        System.out.println(">>>>>>>>>>>Switching to DEFAULT");
        SwitchTo(DEFAULT);
    }
}
 

<TEXT> TOKEN :
{
      <TEXT_LINE:              ~["("," ","\n","\r"] (~["\n","\r"])*  >
|   <TEXT_EOL:                 <EOL> >
}
--------------- end ------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...