parser combinators vs. regex question

View: New views
10 Messages — Rating Filter:   Alert me  

parser combinators vs. regex question

by ArtemGr :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I'm trying to do a CSV parser: http://gist.github.com/115557

A theoretical question:

The following regex: /(?xs) ("(.*?)"|) ; ("(.*?)"|) (?: \r\n | \z )/
have a nice property of ignoring any double quotes in the file which aren't
followed by either a semicolon, and end-of-line or an end-of-file.
In particular, I can pass "\"\"\";" - and it will be a valid CSV file
containing a double-quote in the first column.

If I'm not mistaking, it's called backtracking:
the regular expression engine can see that the second double-quote wasn't not
the proper match and will skip to the third double-quote, which is followed by
a semi-colon.

I have tried to do something similar with parser combinators,
e.g. with
  def stringInQuotes = "" | ('"' ~ rep ("(?s).".r) ~ '"'
    ^^ {case _ ~ chars ~ _ => chars.mkString ("")})
or
  def stringInQuotes = opt ('"' ~ rep (elem("value", (c: Char) => true)) ~ '"')
    ^^ {case None => ""; case Some (_ ~ chars ~ _) => chars.mkString ("")}

but the first doesn't do what expected
and the second goes into an infinite cycle.

Is it possible to do that kind of backtracking with the current (2.7.4) release
of the combinators and if so, then what i'm doing wrong?


Re: parser combinators vs. regex question

by ArtemGr :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

ArtemGr <artemciy@...> writes:

> I have tried to do something similar with parser combinators,
> e.g. with
>   def stringInQuotes = "" | ('"' ~ rep ("(?s).".r) ~ '"'
>     ^^ {case _ ~ chars ~ _ => chars.mkString ("")})
> or
>   def stringInQuotes = opt ('"' ~ rep (elem("value", (c: Char) => true)) ~ '"')
>     ^^ {case None => ""; case Some (_ ~ chars ~ _) => chars.mkString ("")}
>
> but the first doesn't do what expected
> and the second goes into an infinite cycle.

As a side note, it would be good if the
implicit def regex(r: Regex): Parser[String]
method in RegexParsers produced a matcher with access to the matched
groups, instead of just a string.
Then it will be possible to use the regex backtracking features locally inside
the parser combinators.


Re: parser combinators vs. regex question

by Paul Phillips-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, May 21, 2009 at 05:53:35PM +0000, ArtemGr wrote:
> The following regex: /(?xs) ("(.*?)"|) ; ("(.*?)"|) (?: \r\n | \z )/
> have a nice property of ignoring any double quotes in the file which aren't
> followed by either a semicolon, and end-of-line or an end-of-file.
> In particular, I can pass "\"\"\";" - and it will be a valid CSV file
> containing a double-quote in the first column.

For 2.8 the guard combinator has been added, so you could say:

  '"' ~> rep(chars) <~ '"' <~ guard("\r\n" | EOF)

but in 2.7 you can achieve the same effect with double negation:

  '"' ~> rep(chars) <~ '"' <~ not(not("\r\n" | EOF))

This is pseudocode but you should be able to get it going.  Neither
guard nor not consume any input, they only place constraints on a match.

--
Paul Phillips      | Eschew mastication.
Future Perfect     |
Empiricist         |
ha! spill, pupil   |----------* http://www.improving.org/paulp/ *----------

Re: parser combinators vs. regex question

by ArtemGr :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Paul Phillips <paulp@...> writes:

> For 2.8 the guard combinator has been added, so you could say:
>
>   '"' ~> rep(chars) <~ '"' <~ guard("\r\n" | EOF)
>
> but in 2.7 you can achieve the same effect with double negation:
>
>   '"' ~> rep(chars) <~ '"' <~ not(not("\r\n" | EOF))
>
> This is pseudocode but you should be able to get it going.  Neither
> guard nor not consume any input, they only place constraints on a match.

Thanks for the hint.
I think the guard is a look-ahead construct,
it doesn't answer the backtracking question in any way.

Intuitively, since there is a disjunction which effectively does a backtracking
(e.g. if the first Parser didn't work, disj. should return to the same position
and try the second Parser), I think replacing

def stringInQuotes = """(?xs) ".*?" |""".r ^^ {
 case qstr => if (qstr.length != 0) qstr.substring (1, qstr.length - 1) else ""}
def line = stringInQuotes ~ ';' ~ stringInQuotes ~ (CRLF | EOF) ^^ {
 case col1 ~ _ ~ col2 ~ _ => col1 :: col2 :: Nil}

with

def chars1 = rep ("(?s).".r) ^^ {case chars_ => chars_ mkString ""}
def chars2 = rep (elem("value", (c: Char) => true)) ^^ {
 case chars_ => chars_ mkString ""}
def col1 = ('"' ~ chars1 ~ "\";" ^^ {case _ ~ value ~ _ => value}
 | ";" ^^ {case _ => ""})
def col2 = ('"' ~ chars1 ~ ("\"" ~ (CRLF | EOF)) ^^ {
 case _ ~ value ~ _ => value}
 | (CRLF | EOF) ^^ {case _ => ""})
def line = col1 ~ col2 ^^ {case v1 ~ v2 => v1 :: v2 :: Nil}

should just work.
However, like I said earlier, with chars1 it gives wrong results, e.g.

[1.1] failure: `";' expected but `' found
with input "\"qq\nqq\";"

and with chars2 it goes into an infinite loop...



Re: Re: parser combinators vs. regex question

by Paul Phillips-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, May 21, 2009 at 08:01:28PM +0000, ArtemGr wrote:
> [snip]

The degree to which you are overcomplicating this by trying to apply
regexp backtracking defies description.  Your code is too difficult to
read, but what you're trying to do can be done in one or two lines.

> I think the guard is a look-ahead construct, it doesn't answer the
> backtracking question in any way.

What do you think happens when the guard fails? Backtracking is what the
combinators do if you use the backtracking ops (that's | among others.)  
Attempting to reimplement backtracking with regexps inside the
combinator framework is like throwing away your sword so you can fight
off the marauders with a lampshade.

--
Paul Phillips      | The most dangerous man to any government is the man who
Apatheist          | is able to think things out [...] Almost inevitably he
Empiricist         | comes to the conclusion that the government he lives under
pp: i haul pills   | is dishonest, insane, intolerable.   -- H. L. Mencken

Re: parser combinators vs. regex question

by ArtemGr :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Paul Phillips <paulp@...> writes:
> On Thu, May 21, 2009 at 08:01:28PM +0000, ArtemGr wrote:
> > [snip]
>
> The degree to which you are overcomplicating this by trying to apply
> regexp backtracking defies description.

I'm not "trying to apply" regex backtracking, i've already said it isn't
very convenient without access to subgroups.

> Your code is too difficult to
> read, but what you're trying to do can be done in one or two lines.

> > I think the guard is a look-ahead construct, it doesn't answer the
> > backtracking question in any way.
>
> What do you think happens when the guard fails? Backtracking is what the
> combinators do if you use the backtracking ops (that's | among others.)

That's what I implied in the previous post, by saying that disjunction should
use backtracking. Thanks for clarifying. That answers the theoretical
half of my question.

> Attempting to reimplement backtracking with regexps inside the
> combinator framework is like throwing away your sword so you can fight
> off the marauders with a lampshade.

I think the comparison would be rather of throwing away the spare parts of a
calculator in order to fight the maradeurs with the old good regex sword.
After all, the simple and intuitive regex
"""(?xs) ("(.*?)"|) ; ("(.*?)"|) (?: \r?\n | \z ) """
which took me one minute to write - works, and parser combinators, after 12
to 16 hours of tweaking are either fail unexpectedly or go skyrocket into
infinite loop.



Re: parser combinators vs. regex question

by ArtemGr :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

ArtemGr <artemciy@...> writes:
> Is it possible to do that kind of backtracking with the current (2.7.4)
> release of the combinators and if so, then what i'm doing wrong?

I have found an interesting comment
in the javadoc of disjunction method "Parser.|":

<p>`p | q' succeeds if `p' succeeds or `q' succeeds
Note that `q' is only tried if `p's failure is non-fatal
(i.e., back-tracking is allowed).</p>

It implies that backtracking is not always allowed.

I wonder what kind of parsers might produce a "fatal failure" and why?


Re: Re: parser combinators vs. regex question

by Johannes Rudolph-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

A not-matching branch of an alternative is non-fatal.
For better error handling you might anticipate and introduce error
branches (alternatives) to give better error messages. These are
fatal, since you don't want parsing to continue in these cases.

On Fri, May 22, 2009 at 10:49 AM, ArtemGr <artemciy@...> wrote:

> ArtemGr <artemciy@...> writes:
>> Is it possible to do that kind of backtracking with the current (2.7.4)
>> release of the combinators and if so, then what i'm doing wrong?
>
> I have found an interesting comment
> in the javadoc of disjunction method "Parser.|":
>
> <p>`p | q' succeeds if `p' succeeds or `q' succeeds
> Note that `q' is only tried if `p's failure is non-fatal
> (i.e., back-tracking is allowed).</p>
>
> It implies that backtracking is not always allowed.
>
> I wonder what kind of parsers might produce a "fatal failure" and why?
>
>



--
Johannes

-----------------------------------------------
Johannes Rudolph
http://virtual-void.net

Re: Re: parser combinators vs. regex question

by Randall Schulz :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Friday May 22 2009, Johannes Rudolph wrote:
> A not-matching branch of an alternative is non-fatal.
> For better error handling you might anticipate and introduce error
> branches (alternatives) to give better error messages. These are
> fatal, since you don't want parsing to continue in these cases.

Unless, of course, you do. I tend to think that nothing is worse than a
prematurely terminated parse. It's at least as bad as poor error
messages.

There are typically some (often many) errors that don't render the rest
of the parse impossible or meaningless. When I write parsers, I always
try to include as many error productions as possible.


Randall Schulz

Re: Re: parser combinators vs. regex question

by Manohar Jonnalagedda :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Thu, May 21, 2009 at 10:44 AM, ArtemGr <artemciy@...> wrote:
ArtemGr <artemciy@...> writes:
As a side note, it would be good if the
implicit def regex(r: Regex): Parser[String]
method in RegexParsers produced a matcher with access to the matched
groups, instead of just a string.

I have also found the need for this, and have come up with the following solution as a makeshift change, as you can define a parser that does this for you :

  def regexMatch(r : Regex) : Parser[Match] = Parser { in => regex(r)(in) match {
    case Success(aString, theRest) => Success(r.findFirstMatchIn(aString).get, theRest)
    case f@Failure(_,_) => f
    case e@Error(_,_) => e
  }}

The other solution is to change the implicit regex method in RegexParsers itself.

I was also wondering if there is a better solution to parsing different groups.

Thanks,
Manohar