detecting tokenization errors, reporting them to the user

View: New views
2 Messages — Rating Filter:   Alert me  

detecting tokenization errors, reporting them to the user

by Kenneth R Beesley :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Subject:  detecting tokenization errors, and reporting them to the  
user:  request for advice


I've used JavaCC to implement a parser for a programming language that  
has single-quoted tokens ("multicharacter symbols") that look like this:

'name'   'foo'   'bar'

1.  The tokens are delimited with single-quotes.  The "name" of the  
token is the string inside the single quotes.
2.  Literalized single quotes can appear in the name, e.g.  'ab
\'cd'  (the actual name in this case is ab'cd )
3.  The names inside the single quotes should contain at least two  
characters, i.e.  'aa' and 'aaa' are good but 'a'  is an error

Problem:  I'm looking for a better way to detect cases like 'a' and  
signal the user helpfully that they are errors.  If I
didn't have to handle the literalized single quotes, it would be easy.

********

Here's what I'm doing right now.  It's working.  The tokenization for  
MULTICHAR_SYMBOL currently looks like this:

<DEFAULT>        // start of a single-quoted multicharacter symbol
MORE: {
     "'"  :  IN_MULTICHAR_NAME    // switch into new state to match  
the name
}

// emit a MULTICHAR_SYMBOL token when the terminating single-quote is  
found

< IN_MULTICHAR_NAME >
TOKEN: {
     < MULTICHAR_SYMBOL: "'" >
     {
        // delete the delimiting single quotes, leaving the name string
        // image is a built-in StringBuffer that collects the  
characters matched
         image.deleteCharAt(0) ;
         image.deleteCharAt( image.length() - 1 ) ;

         // finally, set the matchedToken.image field (a String) from  
the image (a StringBuffer)
         matchedToken.image = image.toString() ;

        // CHECK FOR THE LENGTH ERROR
         // catch names less than 2 characters long
         if (image.length() < 2) {
             SwitchTo(BAD_TOKENIZATION) ;
         else {
             SwitchTo(DEFAULT) ;
         }
     }
}

// handle literalized single quotes in the multichar-symbol name

< IN_MULTICHAR_NAME >
MORE: {
     "\\'"
     {
         // delete the literalizing backslash in the image, leaving  
the single quote
         image.deleteCharAt(image.length() - 2) ;
     }
}

< IN_MULTICHAR_NAME >
MORE : {
    // collect any symbol except newline, carriage return or single  
quote
     <  ~["\n", "\r", "'"]  >
}

< BAD_TOKENIZATION >
TOKEN: {
    < BAD_TOKEN: ~[] >
}

So the way this currently works, the tokenizer detects cases like 'a',  
where the name inside the single quotes is less than 2 characters  
long, and it switches to the BAD_TOKENIZATION state.  In that state,  
the tokenizer consumes whatever character comes next, labels it a  
BAD_TOKEN, and the parser chokes on the BAD_TOKEN.  (It then does a  
deep error recovery, re-syncs, and starts parsing again.)

This all works, but it doesn't really tell the user what went wrong in  
cases like 'a'.  It's not helpful to the poor user.

Is there a better way, when such errors are detected during  
tokenization, to convey to users what went wrong?

Thanks,

Ken


******************************
Kenneth R. Beesley, D.Phil.
P.O. Box 540475
North Salt Lake, UT
84054  USA






---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


RE: detecting tokenization errors, reporting them to the user

by Mazas Marc :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

You should be able to use any mean you already use for interacting with the user, through custom java code in a java block.
For example if he uses javacc from the command line, javacc displays infos / errors through System.out / System.err, and you could do the same.

< BAD_TOKENIZATION >
TOKEN: {
    < BAD_TOKEN: ~[] >
    { System.err.println("name " + <adequate variable> + " too short ; should be at least 2 characters."); }
}
You can add line and column numbers, too.

Marc MAZAS
 

-----Message d'origine-----
De : Kenneth Reid Beesley [mailto:krbeesley@...]
Envoyé : jeudi 17 septembre 2009 00:23
À : javacc users
Objet : [JavaCC] detecting tokenization errors, reporting them to the user

Subject:  detecting tokenization errors, and reporting them to the
user:  request for advice


I've used JavaCC to implement a parser for a programming language that has single-quoted tokens ("multicharacter symbols") that look like this:

'name'   'foo'   'bar'

1.  The tokens are delimited with single-quotes.  The "name" of the token is the string inside the single quotes.
2.  Literalized single quotes can appear in the name, e.g.  'ab \'cd'  (the actual name in this case is ab'cd ) 3.  The names inside the single quotes should contain at least two characters, i.e.  'aa' and 'aaa' are good but 'a'  is an error

Problem:  I'm looking for a better way to detect cases like 'a' and signal the user helpfully that they are errors.  If I didn't have to handle the literalized single quotes, it would be easy.

********

Here's what I'm doing right now.  It's working.  The tokenization for MULTICHAR_SYMBOL currently looks like this:

<DEFAULT>        // start of a single-quoted multicharacter symbol
MORE: {
     "'"  :  IN_MULTICHAR_NAME    // switch into new state to match  
the name
}

// emit a MULTICHAR_SYMBOL token when the terminating single-quote is found

< IN_MULTICHAR_NAME >
TOKEN: {
     < MULTICHAR_SYMBOL: "'" >
     {
        // delete the delimiting single quotes, leaving the name string
        // image is a built-in StringBuffer that collects the characters matched
         image.deleteCharAt(0) ;
         image.deleteCharAt( image.length() - 1 ) ;

         // finally, set the matchedToken.image field (a String) from the image (a StringBuffer)
         matchedToken.image = image.toString() ;

        // CHECK FOR THE LENGTH ERROR
         // catch names less than 2 characters long
         if (image.length() < 2) {
             SwitchTo(BAD_TOKENIZATION) ;
         else {
             SwitchTo(DEFAULT) ;
         }
     }
}

// handle literalized single quotes in the multichar-symbol name

< IN_MULTICHAR_NAME >
MORE: {
     "\\'"
     {
         // delete the literalizing backslash in the image, leaving the single quote
         image.deleteCharAt(image.length() - 2) ;
     }
}

< IN_MULTICHAR_NAME >
MORE : {
    // collect any symbol except newline, carriage return or single quote
     <  ~["\n", "\r", "'"]  >
}

< BAD_TOKENIZATION >
TOKEN: {
    < BAD_TOKEN: ~[] >
}

So the way this currently works, the tokenizer detects cases like 'a', where the name inside the single quotes is less than 2 characters long, and it switches to the BAD_TOKENIZATION state.  In that state, the tokenizer consumes whatever character comes next, labels it a BAD_TOKEN, and the parser chokes on the BAD_TOKEN.  (It then does a deep error recovery, re-syncs, and starts parsing again.)

This all works, but it doesn't really tell the user what went wrong in cases like 'a'.  It's not helpful to the poor user.

Is there a better way, when such errors are detected during tokenization, to convey to users what went wrong?

Thanks,

Ken


******************************
Kenneth R. Beesley, D.Phil.
P.O. Box 540475
North Salt Lake, UT
84054  USA






---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...