|
View:
New views
2 Messages
—
Rating Filter:
Alert me
|
|
|
detecting tokenization errors, reporting them to the userSubject: detecting tokenization errors, and reporting them to the
user: request for advice I've used JavaCC to implement a parser for a programming language that has single-quoted tokens ("multicharacter symbols") that look like this: 'name' 'foo' 'bar' 1. The tokens are delimited with single-quotes. The "name" of the token is the string inside the single quotes. 2. Literalized single quotes can appear in the name, e.g. 'ab \'cd' (the actual name in this case is ab'cd ) 3. The names inside the single quotes should contain at least two characters, i.e. 'aa' and 'aaa' are good but 'a' is an error Problem: I'm looking for a better way to detect cases like 'a' and signal the user helpfully that they are errors. If I didn't have to handle the literalized single quotes, it would be easy. ******** Here's what I'm doing right now. It's working. The tokenization for MULTICHAR_SYMBOL currently looks like this: <DEFAULT> // start of a single-quoted multicharacter symbol MORE: { "'" : IN_MULTICHAR_NAME // switch into new state to match the name } // emit a MULTICHAR_SYMBOL token when the terminating single-quote is found < IN_MULTICHAR_NAME > TOKEN: { < MULTICHAR_SYMBOL: "'" > { // delete the delimiting single quotes, leaving the name string // image is a built-in StringBuffer that collects the characters matched image.deleteCharAt(0) ; image.deleteCharAt( image.length() - 1 ) ; // finally, set the matchedToken.image field (a String) from the image (a StringBuffer) matchedToken.image = image.toString() ; // CHECK FOR THE LENGTH ERROR // catch names less than 2 characters long if (image.length() < 2) { SwitchTo(BAD_TOKENIZATION) ; else { SwitchTo(DEFAULT) ; } } } // handle literalized single quotes in the multichar-symbol name < IN_MULTICHAR_NAME > MORE: { "\\'" { // delete the literalizing backslash in the image, leaving the single quote image.deleteCharAt(image.length() - 2) ; } } < IN_MULTICHAR_NAME > MORE : { // collect any symbol except newline, carriage return or single quote < ~["\n", "\r", "'"] > } < BAD_TOKENIZATION > TOKEN: { < BAD_TOKEN: ~[] > } So the way this currently works, the tokenizer detects cases like 'a', where the name inside the single quotes is less than 2 characters long, and it switches to the BAD_TOKENIZATION state. In that state, the tokenizer consumes whatever character comes next, labels it a BAD_TOKEN, and the parser chokes on the BAD_TOKEN. (It then does a deep error recovery, re-syncs, and starts parsing again.) This all works, but it doesn't really tell the user what went wrong in cases like 'a'. It's not helpful to the poor user. Is there a better way, when such errors are detected during tokenization, to convey to users what went wrong? Thanks, Ken ****************************** Kenneth R. Beesley, D.Phil. P.O. Box 540475 North Salt Lake, UT 84054 USA --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@... For additional commands, e-mail: users-help@... |
|
|
RE: detecting tokenization errors, reporting them to the userYou should be able to use any mean you already use for interacting with the user, through custom java code in a java block.
For example if he uses javacc from the command line, javacc displays infos / errors through System.out / System.err, and you could do the same. < BAD_TOKENIZATION > TOKEN: { < BAD_TOKEN: ~[] > { System.err.println("name " + <adequate variable> + " too short ; should be at least 2 characters."); } } You can add line and column numbers, too. Marc MAZAS -----Message d'origine----- De : Kenneth Reid Beesley [mailto:krbeesley@...] Envoyé : jeudi 17 septembre 2009 00:23 À : javacc users Objet : [JavaCC] detecting tokenization errors, reporting them to the user Subject: detecting tokenization errors, and reporting them to the user: request for advice I've used JavaCC to implement a parser for a programming language that has single-quoted tokens ("multicharacter symbols") that look like this: 'name' 'foo' 'bar' 1. The tokens are delimited with single-quotes. The "name" of the token is the string inside the single quotes. 2. Literalized single quotes can appear in the name, e.g. 'ab \'cd' (the actual name in this case is ab'cd ) 3. The names inside the single quotes should contain at least two characters, i.e. 'aa' and 'aaa' are good but 'a' is an error Problem: I'm looking for a better way to detect cases like 'a' and signal the user helpfully that they are errors. If I didn't have to handle the literalized single quotes, it would be easy. ******** Here's what I'm doing right now. It's working. The tokenization for MULTICHAR_SYMBOL currently looks like this: <DEFAULT> // start of a single-quoted multicharacter symbol MORE: { "'" : IN_MULTICHAR_NAME // switch into new state to match the name } // emit a MULTICHAR_SYMBOL token when the terminating single-quote is found < IN_MULTICHAR_NAME > TOKEN: { < MULTICHAR_SYMBOL: "'" > { // delete the delimiting single quotes, leaving the name string // image is a built-in StringBuffer that collects the characters matched image.deleteCharAt(0) ; image.deleteCharAt( image.length() - 1 ) ; // finally, set the matchedToken.image field (a String) from the image (a StringBuffer) matchedToken.image = image.toString() ; // CHECK FOR THE LENGTH ERROR // catch names less than 2 characters long if (image.length() < 2) { SwitchTo(BAD_TOKENIZATION) ; else { SwitchTo(DEFAULT) ; } } } // handle literalized single quotes in the multichar-symbol name < IN_MULTICHAR_NAME > MORE: { "\\'" { // delete the literalizing backslash in the image, leaving the single quote image.deleteCharAt(image.length() - 2) ; } } < IN_MULTICHAR_NAME > MORE : { // collect any symbol except newline, carriage return or single quote < ~["\n", "\r", "'"] > } < BAD_TOKENIZATION > TOKEN: { < BAD_TOKEN: ~[] > } So the way this currently works, the tokenizer detects cases like 'a', where the name inside the single quotes is less than 2 characters long, and it switches to the BAD_TOKENIZATION state. In that state, the tokenizer consumes whatever character comes next, labels it a BAD_TOKEN, and the parser chokes on the BAD_TOKEN. (It then does a deep error recovery, re-syncs, and starts parsing again.) This all works, but it doesn't really tell the user what went wrong in cases like 'a'. It's not helpful to the poor user. Is there a better way, when such errors are detected during tokenization, to convey to users what went wrong? Thanks, Ken ****************************** Kenneth R. Beesley, D.Phil. P.O. Box 540475 North Salt Lake, UT 84054 USA --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@... For additional commands, e-mail: users-help@... --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@... For additional commands, e-mail: users-help@... |
| Free embeddable forum powered by Nabble | Forum Help |