|
View:
New views
10 Messages
—
Rating Filter:
Alert me
|
|
|
Matching productions embedded in arbitrary textHi Guys,
I need some help with a grammar (see attached). The grammar processes text that consists of function calls embedded any where and possibly multiple time within arbitrary text. The FunctionCall() production is expected to be recognized when a function name (configurable via setFUNCTION_NAMES) is matched in the text. Arbitrary text is expected to be recognized by the text() production. When I run javacc I get the following error: Error: Line 73, Column 5: Expansion within "(...)*" can be matched by empty string. Can any one help me get over the hump with this problem? If there is a better way to do what I am trying to do please let me know. Thanks in advance for your help. -- Regards, Farrukh Web: http://www.wellfleetsoftware.com /** * * */ options{ STATIC=false ; IGNORE_CASE=true ; // DEBUG_LOOKAHEAD= true ; } PARSER_BEGIN(FunctionParser) package com.example.function; import java.io.Reader; import java.io.FileInputStream; import java.util.HashSet; import java.util.Set; public class FunctionParser { protected static Set<String> FUNCTION_NAMES = new HashSet<String>(); public void reInit(Reader input) { ReInit(input); } protected boolean seeFunctionCall() { return "(".equals(getToken(2).image) && FUNCTION_NAMES.contains(getToken(1).image.toUpperCase()); } public static void setFUNCTION_NAMES(Set<String> functionNames) { FUNCTION_NAMES = functionNames; } } PARSER_END(FunctionParser) SKIP: { " " | "\t" | "\r" | "\n" } TOKEN : /* Numeric Constants */ { < S_NUMBER: <FLOAT> | <FLOAT> ( ["e","E"] ([ "-","+"])? <FLOAT> )? > | < #FLOAT: <INTEGER> | <INTEGER> ( "." <INTEGER> )? | "." <INTEGER> > | < #INTEGER: ( <DIGIT> )+ > | < #DIGIT: ["0" - "9"] > } TOKEN: { < S_IDENTIFIER: (<LETTER>)+ (<DIGIT> | <LETTER> |<SPECIAL_CHARS>)* > | < #LETTER: ["a"-"z", "A"-"Z"] > | < #SPECIAL_CHARS: "_" > | < S_CHAR_LITERAL: "'" (~["'"])* "'" ("'" (~["'"])* "'")*> | < S_QUOTED_IDENTIFIER: "\"" (~["\n","\r","\""])* "\"" > | < S_TEXT: ~[] > } void start(): {} { (textOrFunctionCall())* } void text(): {} { ( <S_TEXT> )* } void textOrFunctionCall(): {} { text() | LOOKAHEAD({seeFunctionCall()}) FunctionCall() } /* ---------------- Grammar Productions --------------------- */ void FunctionCall(): {} { FunctionReference() "(" (FunctionArgumentList())* ")" } void FunctionReference(): { String name; } { <S_IDENTIFIER> {name = token.image;} } void FunctionArgumentList(): {} { FunctionArgument() ("," FunctionArgument())* } void FunctionArgument(): {} { FunctionCall() | <S_QUOTED_IDENTIFIER> | <S_NUMBER> } --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@... For additional commands, e-mail: users-help@... |
|
|
Re: Matching productions embedded in arbitrary textThe base problem is that in some cases, a tokenizer cannot unambiguously
define the token kind. This is similar to a parsing problem with PL/I (which has no reserved keywords). In "IF(...) ... ;" IF can be either a keyword or a variable name and it cannot be determined until the token after the matching right paren is examined. There may be any number of tokens (e.g. expressions) between the parens. One approach that works is to tokenize IF as a keyword and then use a specialized lookahead routine that alters the "kind" field in the IF token to "identifier" if necessary (before it is parsed). In this case, a similar approach using a specialized lookahead to identify the token kind (as was pointed out) is reasonably straight forward. Farrukh Najmi wrote: > Tom Copeland wrote: >> >> On Oct 28, 2009, at 8:56 PM, Farrukh Najmi wrote: >> >>> Hi Chris and Tom, >>> >>> Thank you for your help. Please see inline below... >>> >>> J.Chris Findlay wrote: >>>> I think it might need to be: >>>> texttexttextfoobar("arg1",2.0)texttexttext >>> >>> Above is more closer to what I intended. The only thing to add is that >>> I think the function name would have to be preceded by white space >>> so here a modified version of your example that would be valid: >>> >>> texttexttext foobar("arg1",2.0)texttexttext >>> >>> BTW, to provide some context, I work on ebXML RegRep specificationm. >>> The spec supports multiple query languages (e.g. SQL, EJBQL, SPARQL, >>> XQuery....) and is defining a syntax for user defined functions >>> (UDF). Since the functions could be embedded in any query language >>> expression I cannot assume a specific grammar. >>> I have made some progress on my parser by temporarily changing the >>> text() production to be a choice of various types of tokens for now. >>> So any suggestion on how to handle the arbitrary text requirement >>> would be great. >> >> To me the tricky bit here is that you don't know the function names >> until runtime, so an input string like this: >> >> texttext fiddle("a", 1)texttext >> >> would only be a FunctionCall if the name "fiddle" is in >> FUNCTION_NAMES, right? >> >> With that constraint, I'm not sure how to get JavaCC to tokenize the >> input data. I wonder if you might want to extract the function >> declarations using a regex. You could use a regex like ".* (" + >> someFunctionName + "(.*))" to extract the function declaration from >> the surrounding text, and then you could pass the match data to >> JavaCC to create the parse tree. This would only work if it's just a >> function call, though, it wouldn't work if the function body were >> there because that could include nested parenthesis which would >> confuse the regex. > > Hi Tom, > > My experience suggests that part seems easy. You just define a > LOOKAHEAD that uses a function seeFunctionCall() where the function > looks at next few tokens and decides whether it has a function name or > not. See example below... > > String textOrFunctionCall(): > { > String rs = new String(""); > String ts; > } > { > ( > LOOKAHEAD({!seeFunctionCall()}) ts = text() {rs = rs + ts;} > | LOOKAHEAD({seeFunctionCall()}) ts = FunctionCall() {rs = rs + " " > + ts + " ";} > ) > { > return rs; > } > } > > ------------------------------------------------------------------------ > > > No virus found in this incoming message. > Checked by AVG - www.avg.com > Version: 8.5.423 / Virus Database: 270.14.38/2467 - Release Date: 10/29/09 07:38:00 > > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@... For additional commands, e-mail: users-help@... |
|
|
Re: Matching productions embedded in arbitrary textI would be grateful for *any* general advice on the feasability of my parser's goals. Is it even possible to match specific tokens intermixed with arbitrary text? Has anyone done a similar parser? Farrukh Najmi wrote: > Hi Guys, > > I need some help with a grammar (see attached). > > The grammar processes text that consists of function calls embedded > any where and possibly multiple time within arbitrary text. > > The FunctionCall() production is expected to be recognized when a > function name (configurable via setFUNCTION_NAMES) is matched in the > text. > Arbitrary text is expected to be recognized by the text() production. > > When I run javacc I get the following error: > > Error: Line 73, Column 5: Expansion within "(...)*" can be matched by > empty string. > > Can any one help me get over the hump with this problem? If there is a > better way to do what I am trying to do please let me know. > > Thanks in advance for your help. -- Regards, Farrukh Web: http://www.wellfleetsoftware.com --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@... For additional commands, e-mail: users-help@... |
|
|
Re: Matching productions embedded in arbitrary textOn Oct 28, 2009, at 9:50 AM, Farrukh Najmi wrote: > Hi Guys, > > I need some help with a grammar (see attached). > > The grammar processes text that consists of function calls embedded > any where and possibly multiple time within arbitrary text. > > The FunctionCall() production is expected to be recognized when a > function name (configurable via setFUNCTION_NAMES) is matched in the > text. > Arbitrary text is expected to be recognized by the text() production. > > When I run javacc I get the following error: > > Error: Line 73, Column 5: Expansion within "(...)*" can be matched > by empty string. > > Can any one help me get over the hump with this problem? If there is > a better way to do what I am trying to do please let me know. > > Thanks in advance for your help. Hi Farrukh - In this case the error message is indeed accurate; since (anything)* is a "zero or more" expansion it can be matched by nothing at all. Thus it could just sit there in an infinite loop. Let's see though... so that I understand it, assuming that "foobar" is one of the FUNCTION_NAMES in the HashSet, would this be valid input data? texttexttextfoobar(arg1,arg2)texttexttext Thanks, Tom http://generatingparserswithjavacc.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@... For additional commands, e-mail: users-help@... |
|
|
Re: Matching productions embedded in arbitrary textI think it might need to be:
texttexttextfoobar("arg1",2.0)texttexttext On Thu, Oct 29, 2009 at 11:55 AM, Tom Copeland <tom@...> wrote:
-- - J.Chris Findlay (c: |
|
|
Re: Matching productions embedded in arbitrary textHi Chris and Tom,
Thank you for your help. Please see inline below... J.Chris Findlay wrote: > I think it might need to be: > texttexttextfoobar("arg1",2.0)texttexttext Above is more closer to what I intended. The only thing to add is that I think the function name would have to be preceded by white space so here a modified version of your example that would be valid: texttexttext foobar("arg1",2.0)texttexttext BTW, to provide some context, I work on ebXML RegRep specificationm. The spec supports multiple query languages (e.g. SQL, EJBQL, SPARQL, XQuery....) and is defining a syntax for user defined functions (UDF). Since the functions could be embedded in any query language expression I cannot assume a specific grammar. I have made some progress on my parser by temporarily changing the text() production to be a choice of various types of tokens for now. So any suggestion on how to handle the arbitrary text requirement would be great. Thanks again. > > On Thu, Oct 29, 2009 at 11:55 AM, Tom Copeland <tom@... > <mailto:tom@...>> wrote: > > ... > > Hi Farrukh - > > In this case the error message is indeed accurate; since > (anything)* is a "zero or more" expansion it can be matched by > nothing at all. Thus it could just sit there in an infinite loop. > > Let's see though... so that I understand it, assuming that > "foobar" is one of the FUNCTION_NAMES in the HashSet, would this > be valid input data? > > texttexttextfoobar(arg1,arg2)texttexttext > > Thanks, > > Tom > http://generatingparserswithjavacc.com/ > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscribe@... > <mailto:users-unsubscribe@...> > For additional commands, e-mail: users-help@... > <mailto:users-help@...> > > > > > -- > - J.Chris Findlay > (c: -- Regards, Farrukh Web: http://www.wellfleetsoftware.com --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@... For additional commands, e-mail: users-help@... |
|
|
Re: Matching productions embedded in arbitrary textOn Oct 28, 2009, at 8:56 PM, Farrukh Najmi wrote: > Hi Chris and Tom, > > Thank you for your help. Please see inline below... > > J.Chris Findlay wrote: >> I think it might need to be: >> texttexttextfoobar("arg1",2.0)texttexttext > > Above is more closer to what I intended. The only thing to add is that > I think the function name would have to be preceded by white space > so here a modified version of your example that would be valid: > > texttexttext foobar("arg1",2.0)texttexttext > > BTW, to provide some context, I work on ebXML RegRep specificationm. > The spec supports multiple query languages (e.g. SQL, EJBQL, SPARQL, > XQuery....) and is defining a syntax for user defined functions > (UDF). Since the functions could be embedded in any query language > expression I cannot assume a specific grammar. > I have made some progress on my parser by temporarily changing the > text() production to be a choice of various types of tokens for now. > So any suggestion on how to handle the arbitrary text requirement > would be great. To me the tricky bit here is that you don't know the function names until runtime, so an input string like this: texttext fiddle("a", 1)texttext would only be a FunctionCall if the name "fiddle" is in FUNCTION_NAMES, right? With that constraint, I'm not sure how to get JavaCC to tokenize the input data. I wonder if you might want to extract the function declarations using a regex. You could use a regex like ".* (" + someFunctionName + "(.*))" to extract the function declaration from the surrounding text, and then you could pass the match data to JavaCC to create the parse tree. This would only work if it's just a function call, though, it wouldn't work if the function body were there because that could include nested parenthesis which would confuse the regex. Yours, Tom http://generatingparserswithjavacc.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@... For additional commands, e-mail: users-help@... |
|
|
Re: Matching productions embedded in arbitrary textTom Copeland wrote:
> > On Oct 28, 2009, at 8:56 PM, Farrukh Najmi wrote: > >> Hi Chris and Tom, >> >> Thank you for your help. Please see inline below... >> >> J.Chris Findlay wrote: >>> I think it might need to be: >>> texttexttextfoobar("arg1",2.0)texttexttext >> >> Above is more closer to what I intended. The only thing to add is that >> I think the function name would have to be preceded by white space so >> here a modified version of your example that would be valid: >> >> texttexttext foobar("arg1",2.0)texttexttext >> >> BTW, to provide some context, I work on ebXML RegRep specificationm. >> The spec supports multiple query languages (e.g. SQL, EJBQL, SPARQL, >> XQuery....) and is defining a syntax for user defined functions >> (UDF). Since the functions could be embedded in any query language >> expression I cannot assume a specific grammar. >> I have made some progress on my parser by temporarily changing the >> text() production to be a choice of various types of tokens for now. >> So any suggestion on how to handle the arbitrary text requirement >> would be great. > > To me the tricky bit here is that you don't know the function names > until runtime, so an input string like this: > > texttext fiddle("a", 1)texttext > > would only be a FunctionCall if the name "fiddle" is in > FUNCTION_NAMES, right? > > With that constraint, I'm not sure how to get JavaCC to tokenize the > input data. I wonder if you might want to extract the function > declarations using a regex. You could use a regex like ".* (" + > someFunctionName + "(.*))" to extract the function declaration from > the surrounding text, and then you could pass the match data to JavaCC > to create the parse tree. This would only work if it's just a > function call, though, it wouldn't work if the function body were > there because that could include nested parenthesis which would > confuse the regex. Hi Tom, My experience suggests that part seems easy. You just define a LOOKAHEAD that uses a function seeFunctionCall() where the function looks at next few tokens and decides whether it has a function name or not. See example below... String textOrFunctionCall(): { String rs = new String(""); String ts; } { ( LOOKAHEAD({!seeFunctionCall()}) ts = text() {rs = rs + ts;} | LOOKAHEAD({seeFunctionCall()}) ts = FunctionCall() {rs = rs + " " + ts + " ";} ) { return rs; } } -- Regards, Farrukh Web: http://www.wellfleetsoftware.com --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@... For additional commands, e-mail: users-help@... |
|
|
Re: Matching productions embedded in arbitrary textFarrukh Najmi wrote:
> Tom Copeland wrote: >> To me the tricky bit here is that you don't know the function names >> until runtime, so an input string like this: >> >> texttext fiddle("a", 1)texttext >> >> would only be a FunctionCall if the name "fiddle" is in >> FUNCTION_NAMES, right? >> >> With that constraint, I'm not sure how to get JavaCC to tokenize the >> input data. I wonder if you might want to extract the function >> declarations using a regex. You could use a regex like ".* (" + >> someFunctionName + "(.*))" to extract the function declaration from >> the surrounding text, and then you could pass the match data to >> JavaCC to create the parse tree. This would only work if it's just a >> function call, though, it wouldn't work if the function body were >> there because that could include nested parenthesis which would >> confuse the regex. > > Hi Tom, > > My experience suggests that part seems easy. You just define a > LOOKAHEAD that uses a function seeFunctionCall() where the function > looks at next few tokens and decides whether it has a function name or > not. See example below... > > String textOrFunctionCall(): > { > String rs = new String(""); > String ts; > } > { > ( > LOOKAHEAD({!seeFunctionCall()}) ts = text() {rs = rs + ts;} > | LOOKAHEAD({seeFunctionCall()}) ts = FunctionCall() {rs = rs + " " > + ts + " ";} > ) > { > return rs; > } > } > Now that my parser is working with some place holder production for matching the arbitrary text I now need to make it work for matching arbitrary text that is not a function call. To me this is the real challenge (as I said above the user configured function names is a simple issue that I already have sorted out). Let me as a more focused question. Is it possible to define a grammar that can have occurrence of a specified production within *arbitrary text*? If so does any one have some sample of what that might look like? Thanks. -- Regards, Farrukh Web: http://www.wellfleetsoftware.com --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@... For additional commands, e-mail: users-help@... |
|
|
Re: Matching productions embedded in arbitrary textOn Oct 29, 2009, at 10:36 AM, Farrukh Najmi wrote: >> >> To me the tricky bit here is that you don't know the function names >> until runtime, so an input string like this: >> >> texttext fiddle("a", 1)texttext >> >> would only be a FunctionCall if the name "fiddle" is in >> FUNCTION_NAMES, right? >> >> With that constraint, I'm not sure how to get JavaCC to tokenize >> the input data. I wonder if you might want to extract the function >> declarations using a regex. You could use a regex like ".* (" + >> someFunctionName + "(.*))" to extract the function declaration from >> the surrounding text, and then you could pass the match data to >> JavaCC to create the parse tree. This would only work if it's just >> a function call, though, it wouldn't work if the function body were >> there because that could include nested parenthesis which would >> confuse the regex. > > Hi Tom, > > My experience suggests that part seems easy. You just define a > LOOKAHEAD that uses a function seeFunctionCall() where the function > looks at next few tokens and decides whether it has a function name > or not. See example below... Hm, yup, semantic lookahead could be useful there, good idea. But what I'm wondering about though... how do you tokenize? Do you just let every character be a separate token? Yours, Tom --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@... For additional commands, e-mail: users-help@... |
| Free embeddable forum powered by Nabble | Forum Help |