Matching productions embedded in arbitrary text

View: New views
10 Messages — Rating Filter:   Alert me  

Matching productions embedded in arbitrary text

by Farrukh Najmi :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Guys,

I need some help with a grammar (see attached).

The grammar processes text that consists of function calls embedded any
where and possibly multiple time within arbitrary text.

The FunctionCall() production is expected to be recognized when a
function name (configurable via setFUNCTION_NAMES) is matched in the text.
Arbitrary text is expected to be recognized by the text() production.

When I run javacc I get the following error:

Error: Line 73, Column 5: Expansion within "(...)*" can be matched by
empty string.

Can any one help me get over the hump with this problem? If there is a
better way to do what I am trying to do please let me know.

Thanks in advance for your help.

--
Regards,
Farrukh

Web: http://www.wellfleetsoftware.com



/**
 *
 *
 */

options{
    STATIC=false ;
    IGNORE_CASE=true ;
//  DEBUG_LOOKAHEAD= true ;
}

PARSER_BEGIN(FunctionParser)
package com.example.function;

import java.io.Reader;
import java.io.FileInputStream;
import java.util.HashSet;
import java.util.Set;

public class FunctionParser {
    protected static Set<String> FUNCTION_NAMES = new HashSet<String>();

    public void reInit(Reader input) {
        ReInit(input);
    }

    protected boolean seeFunctionCall() {
        return "(".equals(getToken(2).image)
            && FUNCTION_NAMES.contains(getToken(1).image.toUpperCase());
    }

    public static void setFUNCTION_NAMES(Set<String> functionNames) {
        FUNCTION_NAMES = functionNames;
    }
}
PARSER_END(FunctionParser)


SKIP:
{
    " "
|   "\t"
|   "\r"
|   "\n"
}

TOKEN : /* Numeric Constants */
{
        < S_NUMBER: <FLOAT>
            | <FLOAT> ( ["e","E"] ([ "-","+"])? <FLOAT> )?
    >
  | < #FLOAT: <INTEGER>
            | <INTEGER> ( "." <INTEGER> )?
            | "." <INTEGER>
    >
  | < #INTEGER: ( <DIGIT> )+ >
  | < #DIGIT: ["0" - "9"] >
}

TOKEN:
{
    < S_IDENTIFIER: (<LETTER>)+ (<DIGIT> | <LETTER> |<SPECIAL_CHARS>)* >
  | < #LETTER: ["a"-"z", "A"-"Z"] >
  | < #SPECIAL_CHARS: "_" >
  | < S_CHAR_LITERAL: "'" (~["'"])* "'" ("'" (~["'"])* "'")*>
  | < S_QUOTED_IDENTIFIER: "\"" (~["\n","\r","\""])* "\"" >
  | < S_TEXT: ~[] >
}

void start():
{}
{
    (textOrFunctionCall())*

}

void text():
{}
{
    ( <S_TEXT> )*
}

void textOrFunctionCall():
{}
{
      text()
    | LOOKAHEAD({seeFunctionCall()}) FunctionCall()
}

/* ---------------- Grammar Productions --------------------- */

void FunctionCall():
{}
{
    FunctionReference()  "(" (FunctionArgumentList())* ")"
}

void FunctionReference():
{
    String name;
}
{
    <S_IDENTIFIER> {name = token.image;}

}

void FunctionArgumentList():
{}
{
    FunctionArgument() ("," FunctionArgument())*
}

void FunctionArgument():
{}
{
    FunctionCall()
  | <S_QUOTED_IDENTIFIER>
  | <S_NUMBER>
}




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...

Re: Matching productions embedded in arbitrary text

by Bill Fenlason-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The base problem is that in some cases, a tokenizer cannot unambiguously
define the token kind.

This is similar to a parsing problem with PL/I (which has no reserved
keywords).  In "IF(...) ... ;"  IF can be either a keyword or a variable
name and it cannot be determined until the token after the matching
right paren is examined.  There may be any number of tokens (e.g.
expressions) between the parens.  One approach that works is to tokenize
IF as a keyword and then use a specialized lookahead routine that alters
the "kind" field in the IF token to "identifier" if necessary (before it
is parsed).

In this case, a similar approach using a specialized lookahead to
identify the token kind (as was pointed out) is reasonably straight
forward.

Farrukh Najmi wrote:

> Tom Copeland wrote:
>>
>> On Oct 28, 2009, at 8:56 PM, Farrukh Najmi wrote:
>>
>>> Hi Chris and Tom,
>>>
>>> Thank you for your help. Please see inline below...
>>>
>>> J.Chris Findlay wrote:
>>>> I think it might need to be:
>>>> texttexttextfoobar("arg1",2.0)texttexttext
>>>
>>> Above is more closer to what I intended. The only thing to add is that
>>> I think the function name would have to be preceded by white space
>>> so here a modified version of your example that would be valid:
>>>
>>> texttexttext foobar("arg1",2.0)texttexttext
>>>
>>> BTW, to provide some context, I work on ebXML RegRep specificationm.
>>> The spec supports multiple query languages (e.g. SQL, EJBQL, SPARQL,
>>> XQuery....) and is defining a syntax for user defined functions
>>> (UDF). Since the functions could be embedded in any query language
>>> expression  I cannot assume a specific grammar.
>>> I have made some progress on my parser by temporarily changing the
>>> text() production to be a choice of various types of tokens for now.
>>> So any suggestion on how to handle the arbitrary text requirement
>>> would be great.
>>
>> To me the tricky bit here is that you don't know the function names
>> until runtime, so an input string like this:
>>
>> texttext fiddle("a", 1)texttext
>>
>> would only be a FunctionCall if the name "fiddle" is in
>> FUNCTION_NAMES, right?
>>
>> With that constraint, I'm not sure how to get JavaCC to tokenize the
>> input data.  I wonder if you might want to extract the function
>> declarations using a regex.  You could use a regex like ".* (" +
>> someFunctionName + "(.*))" to extract the function declaration from
>> the surrounding text, and then you could pass the match data to
>> JavaCC to create the parse tree.  This would only work if it's just a
>> function call, though, it wouldn't work if the function body were
>> there because that could include nested parenthesis which would
>> confuse the regex.
>
> Hi Tom,
>
> My experience suggests that part seems easy. You just define a
> LOOKAHEAD that uses a function seeFunctionCall() where the function
> looks at next few tokens and decides whether it has a function name or
> not. See example below...
>
> String textOrFunctionCall():
> {
>    String rs = new String("");
>    String ts;
> }
> {
>    (
>      LOOKAHEAD({!seeFunctionCall()}) ts = text() {rs = rs + ts;}
>    | LOOKAHEAD({seeFunctionCall()}) ts = FunctionCall() {rs = rs + " "
> + ts + " ";}
>    )
>    {
>        return rs;
>    }
> }
>
> ------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 8.5.423 / Virus Database: 270.14.38/2467 - Release Date: 10/29/09 07:38:00
>
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: Matching productions embedded in arbitrary text

by Farrukh Najmi :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


I would be grateful for *any* general advice on the feasability of my
parser's goals.
Is it even possible to match specific tokens intermixed with arbitrary
text? Has anyone done
a similar parser?

Farrukh Najmi wrote:

> Hi Guys,
>
> I need some help with a grammar (see attached).
>
> The grammar processes text that consists of function calls embedded
> any where and possibly multiple time within arbitrary text.
>
> The FunctionCall() production is expected to be recognized when a
> function name (configurable via setFUNCTION_NAMES) is matched in the
> text.
> Arbitrary text is expected to be recognized by the text() production.
>
> When I run javacc I get the following error:
>
> Error: Line 73, Column 5: Expansion within "(...)*" can be matched by
> empty string.
>
> Can any one help me get over the hump with this problem? If there is a
> better way to do what I am trying to do please let me know.
>
> Thanks in advance for your help.


--
Regards,
Farrukh

Web: http://www.wellfleetsoftware.com



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: Matching productions embedded in arbitrary text

by Tom Copeland :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Oct 28, 2009, at 9:50 AM, Farrukh Najmi wrote:

> Hi Guys,
>
> I need some help with a grammar (see attached).
>
> The grammar processes text that consists of function calls embedded  
> any where and possibly multiple time within arbitrary text.
>
> The FunctionCall() production is expected to be recognized when a  
> function name (configurable via setFUNCTION_NAMES) is matched in the  
> text.
> Arbitrary text is expected to be recognized by the text() production.
>
> When I run javacc I get the following error:
>
> Error: Line 73, Column 5: Expansion within "(...)*" can be matched  
> by empty string.
>
> Can any one help me get over the hump with this problem? If there is  
> a better way to do what I am trying to do please let me know.
>
> Thanks in advance for your help.

Hi Farrukh -

In this case the error message is indeed accurate; since (anything)*  
is a "zero or more" expansion it can be matched by nothing at all.  
Thus it could just sit there in an infinite loop.

Let's see though... so that I understand it, assuming that "foobar" is  
one of the FUNCTION_NAMES in the HashSet, would this be valid input  
data?

texttexttextfoobar(arg1,arg2)texttexttext

Thanks,

Tom
http://generatingparserswithjavacc.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: Matching productions embedded in arbitrary text

by J.Chris Findlay :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I think it might need to be:
texttexttextfoobar("arg1",2.0)texttexttext

On Thu, Oct 29, 2009 at 11:55 AM, Tom Copeland <tom@...> wrote:

On Oct 28, 2009, at 9:50 AM, Farrukh Najmi wrote:

Hi Guys,

I need some help with a grammar (see attached).

The grammar processes text that consists of function calls embedded any where and possibly multiple time within arbitrary text.

The FunctionCall() production is expected to be recognized when a function name (configurable via setFUNCTION_NAMES) is matched in the text.
Arbitrary text is expected to be recognized by the text() production.

When I run javacc I get the following error:

Error: Line 73, Column 5: Expansion within "(...)*" can be matched by empty string.

Can any one help me get over the hump with this problem? If there is a better way to do what I am trying to do please let me know.

Thanks in advance for your help.

Hi Farrukh -

In this case the error message is indeed accurate; since (anything)* is a "zero or more" expansion it can be matched by nothing at all.  Thus it could just sit there in an infinite loop.

Let's see though... so that I understand it, assuming that "foobar" is one of the FUNCTION_NAMES in the HashSet, would this be valid input data?

texttexttextfoobar(arg1,arg2)texttexttext

Thanks,

Tom
http://generatingparserswithjavacc.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...




--
- J.Chris Findlay
  (c:

Re: Matching productions embedded in arbitrary text

by Farrukh Najmi :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Chris and Tom,

Thank you for your help. Please see inline below...

J.Chris Findlay wrote:
> I think it might need to be:
> texttexttextfoobar("arg1",2.0)texttexttext

Above is more closer to what I intended. The only thing to add is that
I think the function name would have to be preceded by white space so
here a modified version of your example that would be valid:

texttexttext foobar("arg1",2.0)texttexttext

BTW, to provide some context, I work on ebXML RegRep specificationm. The
spec supports multiple query languages (e.g. SQL, EJBQL, SPARQL,
XQuery....) and is defining a syntax for user defined functions (UDF).
Since the functions could be embedded in any query language expression  
I cannot assume a specific grammar.

I have made some progress on my parser by temporarily changing the
text() production to be a choice of various types of tokens for now.
So any suggestion on how to handle the arbitrary text requirement would
be great.

Thanks again.

>
> On Thu, Oct 29, 2009 at 11:55 AM, Tom Copeland <tom@...
> <mailto:tom@...>> wrote:
>
>
...

>
>     Hi Farrukh -
>
>     In this case the error message is indeed accurate; since
>     (anything)* is a "zero or more" expansion it can be matched by
>     nothing at all.  Thus it could just sit there in an infinite loop.
>
>     Let's see though... so that I understand it, assuming that
>     "foobar" is one of the FUNCTION_NAMES in the HashSet, would this
>     be valid input data?
>
>     texttexttextfoobar(arg1,arg2)texttexttext
>
>     Thanks,
>
>     Tom
>     http://generatingparserswithjavacc.com/
>
>
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: users-unsubscribe@...
>     <mailto:users-unsubscribe@...>
>     For additional commands, e-mail: users-help@...
>     <mailto:users-help@...>
>
>
>
>
> --
> - J.Chris Findlay
>   (c:


--
Regards,
Farrukh

Web: http://www.wellfleetsoftware.com



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: Matching productions embedded in arbitrary text

by Tom Copeland :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Oct 28, 2009, at 8:56 PM, Farrukh Najmi wrote:

> Hi Chris and Tom,
>
> Thank you for your help. Please see inline below...
>
> J.Chris Findlay wrote:
>> I think it might need to be:
>> texttexttextfoobar("arg1",2.0)texttexttext
>
> Above is more closer to what I intended. The only thing to add is that
> I think the function name would have to be preceded by white space  
> so here a modified version of your example that would be valid:
>
> texttexttext foobar("arg1",2.0)texttexttext
>
> BTW, to provide some context, I work on ebXML RegRep specificationm.  
> The spec supports multiple query languages (e.g. SQL, EJBQL, SPARQL,  
> XQuery....) and is defining a syntax for user defined functions  
> (UDF). Since the functions could be embedded in any query language  
> expression  I cannot assume a specific grammar.
> I have made some progress on my parser by temporarily changing the  
> text() production to be a choice of various types of tokens for now.
> So any suggestion on how to handle the arbitrary text requirement  
> would be great.

To me the tricky bit here is that you don't know the function names  
until runtime, so an input string like this:

texttext fiddle("a", 1)texttext

would only be a FunctionCall if the name "fiddle" is in  
FUNCTION_NAMES, right?

With that constraint, I'm not sure how to get JavaCC to tokenize the  
input data.  I wonder if you might want to extract the function  
declarations using a regex.  You could use a regex like ".* (" +  
someFunctionName + "(.*))" to extract the function declaration from  
the surrounding text, and then you could pass the match data to JavaCC  
to create the parse tree.  This would only work if it's just a  
function call, though, it wouldn't work if the function body were  
there because that could include nested parenthesis which would  
confuse the regex.

Yours,

Tom
http://generatingparserswithjavacc.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: Matching productions embedded in arbitrary text

by Farrukh Najmi :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Tom Copeland wrote:

>
> On Oct 28, 2009, at 8:56 PM, Farrukh Najmi wrote:
>
>> Hi Chris and Tom,
>>
>> Thank you for your help. Please see inline below...
>>
>> J.Chris Findlay wrote:
>>> I think it might need to be:
>>> texttexttextfoobar("arg1",2.0)texttexttext
>>
>> Above is more closer to what I intended. The only thing to add is that
>> I think the function name would have to be preceded by white space so
>> here a modified version of your example that would be valid:
>>
>> texttexttext foobar("arg1",2.0)texttexttext
>>
>> BTW, to provide some context, I work on ebXML RegRep specificationm.
>> The spec supports multiple query languages (e.g. SQL, EJBQL, SPARQL,
>> XQuery....) and is defining a syntax for user defined functions
>> (UDF). Since the functions could be embedded in any query language
>> expression  I cannot assume a specific grammar.
>> I have made some progress on my parser by temporarily changing the
>> text() production to be a choice of various types of tokens for now.
>> So any suggestion on how to handle the arbitrary text requirement
>> would be great.
>
> To me the tricky bit here is that you don't know the function names
> until runtime, so an input string like this:
>
> texttext fiddle("a", 1)texttext
>
> would only be a FunctionCall if the name "fiddle" is in
> FUNCTION_NAMES, right?
>
> With that constraint, I'm not sure how to get JavaCC to tokenize the
> input data.  I wonder if you might want to extract the function
> declarations using a regex.  You could use a regex like ".* (" +
> someFunctionName + "(.*))" to extract the function declaration from
> the surrounding text, and then you could pass the match data to JavaCC
> to create the parse tree.  This would only work if it's just a
> function call, though, it wouldn't work if the function body were
> there because that could include nested parenthesis which would
> confuse the regex.

Hi Tom,

My experience suggests that part seems easy. You just define a LOOKAHEAD
that uses a function seeFunctionCall() where the function looks at next
few tokens and decides whether it has a function name or not. See
example below...

String textOrFunctionCall():
{
    String rs = new String("");
    String ts;
}
{
    (
      LOOKAHEAD({!seeFunctionCall()}) ts = text() {rs = rs + ts;}
    | LOOKAHEAD({seeFunctionCall()}) ts = FunctionCall() {rs = rs + " "
+ ts + " ";}
    )
    {
        return rs;
    }
}

--
Regards,
Farrukh

Web: http://www.wellfleetsoftware.com



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: Matching productions embedded in arbitrary text

by Farrukh Najmi :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Farrukh Najmi wrote:

> Tom Copeland wrote:
>> To me the tricky bit here is that you don't know the function names
>> until runtime, so an input string like this:
>>
>> texttext fiddle("a", 1)texttext
>>
>> would only be a FunctionCall if the name "fiddle" is in
>> FUNCTION_NAMES, right?
>>
>> With that constraint, I'm not sure how to get JavaCC to tokenize the
>> input data.  I wonder if you might want to extract the function
>> declarations using a regex.  You could use a regex like ".* (" +
>> someFunctionName + "(.*))" to extract the function declaration from
>> the surrounding text, and then you could pass the match data to
>> JavaCC to create the parse tree.  This would only work if it's just a
>> function call, though, it wouldn't work if the function body were
>> there because that could include nested parenthesis which would
>> confuse the regex.
>
> Hi Tom,
>
> My experience suggests that part seems easy. You just define a
> LOOKAHEAD that uses a function seeFunctionCall() where the function
> looks at next few tokens and decides whether it has a function name or
> not. See example below...
>
> String textOrFunctionCall():
> {
>    String rs = new String("");
>    String ts;
> }
> {
>    (
>      LOOKAHEAD({!seeFunctionCall()}) ts = text() {rs = rs + ts;}
>    | LOOKAHEAD({seeFunctionCall()}) ts = FunctionCall() {rs = rs + " "
> + ts + " ";}
>    )
>    {
>        return rs;
>    }
> }
>

Now that my parser is working with some place holder production for
matching the arbitrary text I now need to make it work for matching
arbitrary text that is not a function call. To me this is the real
challenge (as I said above the user configured function names is a
simple issue that I already have sorted out).

Let me as a more focused question.

Is it possible to define a grammar that can have occurrence of a
specified production within *arbitrary text*? If so does any one have
some sample of what that might look like?

Thanks.

--
Regards,
Farrukh

Web: http://www.wellfleetsoftware.com



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: Matching productions embedded in arbitrary text

by Tom Copeland :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Oct 29, 2009, at 10:36 AM, Farrukh Najmi wrote:

>>
>> To me the tricky bit here is that you don't know the function names  
>> until runtime, so an input string like this:
>>
>> texttext fiddle("a", 1)texttext
>>
>> would only be a FunctionCall if the name "fiddle" is in  
>> FUNCTION_NAMES, right?
>>
>> With that constraint, I'm not sure how to get JavaCC to tokenize  
>> the input data.  I wonder if you might want to extract the function  
>> declarations using a regex.  You could use a regex like ".* (" +  
>> someFunctionName + "(.*))" to extract the function declaration from  
>> the surrounding text, and then you could pass the match data to  
>> JavaCC to create the parse tree.  This would only work if it's just  
>> a function call, though, it wouldn't work if the function body were  
>> there because that could include nested parenthesis which would  
>> confuse the regex.
>
> Hi Tom,
>
> My experience suggests that part seems easy. You just define a  
> LOOKAHEAD that uses a function seeFunctionCall() where the function  
> looks at next few tokens and decides whether it has a function name  
> or not. See example below...

Hm, yup, semantic lookahead could be useful there, good idea.  But  
what I'm wondering about though... how do you tokenize?  Do you just  
let every character be a separate token?

Yours,

Tom


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...