|
View:
New views
10 Messages
—
Rating Filter:
Alert me
|
|
|
parser combinators vs. regex questionI'm trying to do a CSV parser: http://gist.github.com/115557
A theoretical question: The following regex: /(?xs) ("(.*?)"|) ; ("(.*?)"|) (?: \r\n | \z )/ have a nice property of ignoring any double quotes in the file which aren't followed by either a semicolon, and end-of-line or an end-of-file. In particular, I can pass "\"\"\";" - and it will be a valid CSV file containing a double-quote in the first column. If I'm not mistaking, it's called backtracking: the regular expression engine can see that the second double-quote wasn't not the proper match and will skip to the third double-quote, which is followed by a semi-colon. I have tried to do something similar with parser combinators, e.g. with def stringInQuotes = "" | ('"' ~ rep ("(?s).".r) ~ '"' ^^ {case _ ~ chars ~ _ => chars.mkString ("")}) or def stringInQuotes = opt ('"' ~ rep (elem("value", (c: Char) => true)) ~ '"') ^^ {case None => ""; case Some (_ ~ chars ~ _) => chars.mkString ("")} but the first doesn't do what expected and the second goes into an infinite cycle. Is it possible to do that kind of backtracking with the current (2.7.4) release of the combinators and if so, then what i'm doing wrong? |
|
|
Re: parser combinators vs. regex questionArtemGr <artemciy@...> writes:
> I have tried to do something similar with parser combinators, > e.g. with > def stringInQuotes = "" | ('"' ~ rep ("(?s).".r) ~ '"' > ^^ {case _ ~ chars ~ _ => chars.mkString ("")}) > or > def stringInQuotes = opt ('"' ~ rep (elem("value", (c: Char) => true)) ~ '"') > ^^ {case None => ""; case Some (_ ~ chars ~ _) => chars.mkString ("")} > > but the first doesn't do what expected > and the second goes into an infinite cycle. As a side note, it would be good if the implicit def regex(r: Regex): Parser[String] method in RegexParsers produced a matcher with access to the matched groups, instead of just a string. Then it will be possible to use the regex backtracking features locally inside the parser combinators. |
|
|
Re: parser combinators vs. regex questionOn Thu, May 21, 2009 at 05:53:35PM +0000, ArtemGr wrote:
> The following regex: /(?xs) ("(.*?)"|) ; ("(.*?)"|) (?: \r\n | \z )/ > have a nice property of ignoring any double quotes in the file which aren't > followed by either a semicolon, and end-of-line or an end-of-file. > In particular, I can pass "\"\"\";" - and it will be a valid CSV file > containing a double-quote in the first column. For 2.8 the guard combinator has been added, so you could say: '"' ~> rep(chars) <~ '"' <~ guard("\r\n" | EOF) but in 2.7 you can achieve the same effect with double negation: '"' ~> rep(chars) <~ '"' <~ not(not("\r\n" | EOF)) This is pseudocode but you should be able to get it going. Neither guard nor not consume any input, they only place constraints on a match. -- Paul Phillips | Eschew mastication. Future Perfect | Empiricist | ha! spill, pupil |----------* http://www.improving.org/paulp/ *---------- |
|
|
Re: parser combinators vs. regex questionPaul Phillips <paulp@...> writes:
> For 2.8 the guard combinator has been added, so you could say: > > '"' ~> rep(chars) <~ '"' <~ guard("\r\n" | EOF) > > but in 2.7 you can achieve the same effect with double negation: > > '"' ~> rep(chars) <~ '"' <~ not(not("\r\n" | EOF)) > > This is pseudocode but you should be able to get it going. Neither > guard nor not consume any input, they only place constraints on a match. Thanks for the hint. I think the guard is a look-ahead construct, it doesn't answer the backtracking question in any way. Intuitively, since there is a disjunction which effectively does a backtracking (e.g. if the first Parser didn't work, disj. should return to the same position and try the second Parser), I think replacing def stringInQuotes = """(?xs) ".*?" |""".r ^^ { case qstr => if (qstr.length != 0) qstr.substring (1, qstr.length - 1) else ""} def line = stringInQuotes ~ ';' ~ stringInQuotes ~ (CRLF | EOF) ^^ { case col1 ~ _ ~ col2 ~ _ => col1 :: col2 :: Nil} with def chars1 = rep ("(?s).".r) ^^ {case chars_ => chars_ mkString ""} def chars2 = rep (elem("value", (c: Char) => true)) ^^ { case chars_ => chars_ mkString ""} def col1 = ('"' ~ chars1 ~ "\";" ^^ {case _ ~ value ~ _ => value} | ";" ^^ {case _ => ""}) def col2 = ('"' ~ chars1 ~ ("\"" ~ (CRLF | EOF)) ^^ { case _ ~ value ~ _ => value} | (CRLF | EOF) ^^ {case _ => ""}) def line = col1 ~ col2 ^^ {case v1 ~ v2 => v1 :: v2 :: Nil} should just work. However, like I said earlier, with chars1 it gives wrong results, e.g. [1.1] failure: `";' expected but `' found with input "\"qq\nqq\";" and with chars2 it goes into an infinite loop... |
|
|
Re: Re: parser combinators vs. regex questionOn Thu, May 21, 2009 at 08:01:28PM +0000, ArtemGr wrote:
> [snip] The degree to which you are overcomplicating this by trying to apply regexp backtracking defies description. Your code is too difficult to read, but what you're trying to do can be done in one or two lines. > I think the guard is a look-ahead construct, it doesn't answer the > backtracking question in any way. What do you think happens when the guard fails? Backtracking is what the combinators do if you use the backtracking ops (that's | among others.) Attempting to reimplement backtracking with regexps inside the combinator framework is like throwing away your sword so you can fight off the marauders with a lampshade. -- Paul Phillips | The most dangerous man to any government is the man who Apatheist | is able to think things out [...] Almost inevitably he Empiricist | comes to the conclusion that the government he lives under pp: i haul pills | is dishonest, insane, intolerable. -- H. L. Mencken |
|
|
Re: parser combinators vs. regex questionPaul Phillips <paulp@...> writes:
> On Thu, May 21, 2009 at 08:01:28PM +0000, ArtemGr wrote: > > [snip] > > The degree to which you are overcomplicating this by trying to apply > regexp backtracking defies description. I'm not "trying to apply" regex backtracking, i've already said it isn't very convenient without access to subgroups. > Your code is too difficult to > read, but what you're trying to do can be done in one or two lines. > > I think the guard is a look-ahead construct, it doesn't answer the > > backtracking question in any way. > > What do you think happens when the guard fails? Backtracking is what the > combinators do if you use the backtracking ops (that's | among others.) That's what I implied in the previous post, by saying that disjunction should use backtracking. Thanks for clarifying. That answers the theoretical half of my question. > Attempting to reimplement backtracking with regexps inside the > combinator framework is like throwing away your sword so you can fight > off the marauders with a lampshade. I think the comparison would be rather of throwing away the spare parts of a calculator in order to fight the maradeurs with the old good regex sword. After all, the simple and intuitive regex """(?xs) ("(.*?)"|) ; ("(.*?)"|) (?: \r?\n | \z ) """ which took me one minute to write - works, and parser combinators, after 12 to 16 hours of tweaking are either fail unexpectedly or go skyrocket into infinite loop. |
|
|
Re: parser combinators vs. regex questionArtemGr <artemciy@...> writes:
> Is it possible to do that kind of backtracking with the current (2.7.4) > release of the combinators and if so, then what i'm doing wrong? I have found an interesting comment in the javadoc of disjunction method "Parser.|": <p>`p | q' succeeds if `p' succeeds or `q' succeeds Note that `q' is only tried if `p's failure is non-fatal (i.e., back-tracking is allowed).</p> It implies that backtracking is not always allowed. I wonder what kind of parsers might produce a "fatal failure" and why? |
|
|
Re: Re: parser combinators vs. regex questionA not-matching branch of an alternative is non-fatal.
For better error handling you might anticipate and introduce error branches (alternatives) to give better error messages. These are fatal, since you don't want parsing to continue in these cases. On Fri, May 22, 2009 at 10:49 AM, ArtemGr <artemciy@...> wrote: > ArtemGr <artemciy@...> writes: >> Is it possible to do that kind of backtracking with the current (2.7.4) >> release of the combinators and if so, then what i'm doing wrong? > > I have found an interesting comment > in the javadoc of disjunction method "Parser.|": > > <p>`p | q' succeeds if `p' succeeds or `q' succeeds > Note that `q' is only tried if `p's failure is non-fatal > (i.e., back-tracking is allowed).</p> > > It implies that backtracking is not always allowed. > > I wonder what kind of parsers might produce a "fatal failure" and why? > > -- Johannes ----------------------------------------------- Johannes Rudolph http://virtual-void.net |
|
|
Re: Re: parser combinators vs. regex questionOn Friday May 22 2009, Johannes Rudolph wrote:
> A not-matching branch of an alternative is non-fatal. > For better error handling you might anticipate and introduce error > branches (alternatives) to give better error messages. These are > fatal, since you don't want parsing to continue in these cases. Unless, of course, you do. I tend to think that nothing is worse than a prematurely terminated parse. It's at least as bad as poor error messages. There are typically some (often many) errors that don't render the rest of the parse impossible or meaningless. When I write parsers, I always try to include as many error productions as possible. Randall Schulz |
|
|
Re: Re: parser combinators vs. regex questionOn Thu, May 21, 2009 at 10:44 AM, ArtemGr <artemciy@...> wrote:
I have also found the need for this, and have come up with the following solution as a makeshift change, as you can define a parser that does this for you : def regexMatch(r : Regex) : Parser[Match] = Parser { in => regex(r)(in) match { case Success(aString, theRest) => Success(r.findFirstMatchIn(aString).get, theRest) case f@Failure(_,_) => f case e@Error(_,_) => e }} The other solution is to change the implicit regex method in RegexParsers itself. I was also wondering if there is a better solution to parsing different groups. Thanks, Manohar |
| Free embeddable forum powered by Nabble | Forum Help |