Grammar for parsing regular expression

View: New views
9 Messages — Rating Filter:   Alert me  

Grammar for parsing regular expression

by Krzysztof Langner :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello,

I'm working on grammar for ActionScript language. I learned a lot from java grammar include with javacc examples.
Now I want to parse regular expression. My first idea was to create token similar to string token, but it looks that it isn't possible to decide was is regex and what is math expression on the token level.

Since there are regular expression in java I was looking for examples in java 1.5 grammar.
No luck.

Since I don't want to reinvent wheel :-), anyone knows where I can find grammar for javacc with regular expression parsing?

Thank you in advance
Krzysztof

RE: Grammar for parsing regular expression

by Laughing Man :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.
I find it difficult to believe this is impossible, but then I don't know ActionScript. Still, the fact that ActionScript exists as a language suggests it is possible to parse it.

I'm going to guess that the answer is to use lexical states.  That's how the same characters can have different meanings in different situations.





Date: Wed, 15 Apr 2009 14:04:05 +0200
From: klangner@...
To: users@...
Subject: [JavaCC] Grammar for parsing regular expression

Hello,

I'm working on grammar for ActionScript language. I learned a lot from java grammar include with javacc examples.
Now I want to parse regular expression. My first idea was to create token similar to string token, but it looks that it isn't possible to decide was is regex and what is math expression on the token level.

Since there are regular expression in java I was looking for examples in java 1.5 grammar.
No luck.

Since I don't want to reinvent wheel :-), anyone knows where I can find grammar for javacc with regular expression parsing?

Thank you in advance
Krzysztof


Get news, entertainment and everything you care about at Live.com. Check it out!

Re: Grammar for parsing regular expression

by Randall Schulz :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wednesday April 15 2009, Krzysztof Langner wrote:
> Hello,
>
> I'm working on grammar for ActionScript language. ...
>
> Since there are regular expression in java I was looking
> for examples in java 1.5 grammar. No luck.

Java doesn't have a special RE syntax. They're encoded in
and compiled from plain old strings.

Also, since you mention ActionScript (of which I know nothing),
I'll point out that there are a great many different surface
syntaxes for notating regular expressions.


> Since I don't want to reinvent wheel :-), anyone knows where I can
> find grammar for javacc with regular expression parsing?

Specifically the REs of ActionScript?


> Thank you in advance
> Krzysztof


Randall Schulz

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: Grammar for parsing regular expression

by Krzysztof Langner :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Yes you are right. I forgot that in java you put regex in strings. In action script RE syntax is:

myVariable = /abc/i

So even this syntax is valid:
myVariable = /"/

The problem is that it is similar to math expressions like 2/3/4.
Because of it I can't define it as TOKEN (and have problems with strings too).

I think the same syntax is in javascript and in ruby maybe in other languages too. Thats why I'm looking for same grammar similar to AS to learn how this problem can be solved.

Thank you
Krzysztof

On Wed, Apr 15, 2009 at 16:00, Randall R Schulz <rschulz@...> wrote:
On Wednesday April 15 2009, Krzysztof Langner wrote:
> Hello,
>
> I'm working on grammar for ActionScript language. ...
>
> Since there are regular expression in java I was looking
> for examples in java 1.5 grammar. No luck.

Java doesn't have a special RE syntax. They're encoded in
and compiled from plain old strings.

Also, since you mention ActionScript (of which I know nothing),
I'll point out that there are a great many different surface
syntaxes for notating regular expressions.


> Since I don't want to reinvent wheel :-), anyone knows where I can
> find grammar for javacc with regular expression parsing?

Specifically the REs of ActionScript?


> Thank you in advance
> Krzysztof


Randall Schulz

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...



Re: Grammar for parsing regular expression

by Tom Copeland :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Apr 15, 2009, at 11:11 AM, Krzysztof Langner wrote:

> Yes you are right. I forgot that in java you put regex in strings.  
> In action script RE syntax is:
>
> myVariable = /abc/i
>
> So even this syntax is valid:
> myVariable = /"/
>
> The problem is that it is similar to math expressions like 2/3/4.
> Because of it I can't define it as TOKEN (and have problems with  
> strings too).
>
> I think the same syntax is in javascript and in ruby maybe in other  
> languages too. Thats why I'm looking for same grammar similar to AS  
> to learn how this problem can be solved.

There's an ecmascript grammar that's got a lexical-state based  
tokenizer for regular expressions; I'll send it to you,

Yours,

Tom
http://generatingparserswithjavacc.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: Grammar for parsing regular expression

by Randall Schulz :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wednesday April 15 2009, Krzysztof Langner wrote:
> Yes you are right. I forgot that in java you put regex in strings. In
> action script RE syntax is:
>
> myVariable = /abc/i
>
> So even this syntax is valid:
> myVariable = /"/

As in Groovy.


> The problem is that it is similar to math expressions like 2/3/4.
> Because of it I can't define it as TOKEN (and have problems with
> strings too).

You will never, ever be able to write a lexeme / token definition (a
regular expression) to recognize the syntax of regular expressions.
Parsing regular expressions requires a push-down automata, owing to the
balanced, nestable constructs such as parentheses and (in certain
dialects) character classes and its use of infix operators (i.e.,
alternation).

So you can save yourself time and energy by accepting the fundamental
fact that REs cannot be parsed by REs.


> ...
>
> Thank you
> Krzysztof


Randall Schulz

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: Grammar for parsing regular expression

by J.Chris Findlay :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Apr 16, 2009 at 3:48 AM, Randall R Schulz <rschulz@...> wrote:
> So you can save yourself time and energy by accepting the fundamental
> fact that REs cannot be parsed by REs.

At least, not all at once (c:
(Though PCRE-style RE's can get close, and given a bounded RE text
(i.e. extracted from everything else by some lexical state change or
similar), are a great help at splitting it up.)

When I built a PCRE parser, it was its own animal (hand-coded state
machine) and only dealt with RE's.  To parse RE's out of something
else its much easier (read only possible) to do it as either a lexical
state subset or a separate parser, depending on things like how
complex this particular dialect of RE's is.

--
- J.Chris Findlay
  (c:

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@...
For additional commands, e-mail: users-help@...


Re: Grammar for parsing regular expression

by Krzysztof Langner :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


You will never, ever be able to write a lexeme / token definition (a
regular expression) to recognize the syntax of regular expressions.
Parsing regular expressions requires a push-down automata, owing to the
balanced, nestable constructs such as parentheses and (in certain
dialects) character classes and its use of infix operators (i.e.,
alternation).

I know. I don't want to parse RE with lexer (tokens). It would be enough for me to just represent RE as token.
It could almost be done with this simple definition:

< REGEX_LITERAL:
      "/"
      (   (~["/","\n","\r"]) | ("\\" "/")
      )*
      "/"
  >

And the problem is not with RE syntax. But with math expression. Because now int this expression:

my_variable = 2/3/4;

I'll get REGEX_LITERAL token.

But Tom Copeland  helped me with this problem by sending me ECMAScript grammar. It looks that the problem is solved there on the parser level.

Thank you all for help :-)
Krzysztof

RE: Grammar for parsing regular expression

by Laughing Man :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.
One option, is to simply treat regex as any other variable, and leave handling regular expressions for later. In otherwords, just treat anything after an equal sign as a generic "assignedValue" token (or something with some better name).

This does mean that a lot of invalid assignments can be tokenized, but you can catch those later, some other way than with the tokenizer.

Date: Thu, 16 Apr 2009 17:08:48 +0200
From: klangner@...
To: users@...
Subject: Re: [JavaCC] Grammar for parsing regular expression


You will never, ever be able to write a lexeme / token definition (a
regular expression) to recognize the syntax of regular expressions.
Parsing regular expressions requires a push-down automata, owing to the
balanced, nestable constructs such as parentheses and (in certain
dialects) character classes and its use of infix operators (i.e.,
alternation).

I know. I don't want to parse RE with lexer (tokens). It would be enough for me to just represent RE as token.
It could almost be done with this simple definition:

< REGEX_LITERAL:
      "/"
      (   (~["/","\n","\r"]) | ("\\" "/")
      )*
      "/"
  >

And the problem is not with RE syntax. But with math expression. Because now int this expression:

my_variable = 2/3/4;

I'll get REGEX_LITERAL token.

But Tom Copeland  helped me with this problem by sending me ECMAScript grammar. It looks that the problem is solved there on the parser level.

Thank you all for help :-)
Krzysztof


Invite your mail contacts to join your friends list with Windows Live Spaces. It's easy! Try it!