Certain non-zero-length non-matching regexes run forever on Saxon

View: New views
5 Messages — Rating Filter:   Alert me  

Certain non-zero-length non-matching regexes run forever on Saxon

by Abel Braaksma (online) :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Saxon users,

I don't usually do any cross-posts, but this time I think I have to, my
apologies beforehand. I abridged the text a bit, sorry if it is still
rather wieldy. Original subject line: "Backtracking and eternal loops
caused by regular expressions matching: what to expect from
implementations?"

[original, altered message]

I happen to run into an XSLT 2 behavior that I am uncertain of if it is
desired or expected behavior of implementations, if it is undesired, but
still expected, or if it should be prohibited.

Now, suppose this smallest possible match is zero length. In this
situation, with classic regex parsers, the engine will try forever on a
non-matching string (or at least very long). In XSLT, this behavior
seems to happen even when the "smallest possible match" is of non-zero
length. Note that this applies to subexpressions only, because
zero-length matching regex (in its entirety) is not allowed in XSLT.

In XSLT this is shown as follows, where I use the example of matching a
CSV quoted string which may escape the quote by doubling it:
Good csv string examples:
"well quoted csv string"
"well ""quoted"" csv string

Bad csv string example:
not well "quoted csv string

The regular expression to match a quoted CSV string is (include quotes):
"([^"]+|"")*"

This gives trouble in the following XSLT (showing a non-matching
string), which runs for a very long time (exponential performance chart)

<xsl:variable name="ltr">
   before " after and some more text after
</xsl:variable>
<xsl:variable name="regex">"([^"]+|"")*"</xsl:variable>

<xsl:analyze-string select="$ltr" regex="{$regex}">
   <xsl:matching-substring>
      <found><xsl:value-of select="." /></found>
   </xsl:matching-substring>
   <xsl:non-matching-substring>
       <not-found><xsl:value-of select="." /></not-found>
   </xsl:non-matching-substring>
</xsl:analyze-string>


The above example ran for > 140 minutes and still runs (it does not give
any problems with matching strings, but does give problems with strings
that have a large non-matching part with a unequal quote count in it). A
simpler regex (but with less practicability) that shows the same
behavior (quotes are again part of the regex): "([^"]+)*"

I am under the impression that such behavior is not desirable, but I am
unsure if there is anything in the specs that says something about how
implementations should/must deal with this. As a comparison, I tried the
example with Perl, which gave no noticeable performance troubles. I am
aware of the fact that Saxon relies havily on Sun's Java regex engine,
but I also know that it does a translation beforehand.

Note that repeating a possible-empty match is dangerous and should be
avoided. But in the example above, I use a subexpression that never
matches an empty string (the + quantifier means 1 or more) and as such
should not happen to fall into the eternal-loop problem.

I use Saxon 8.8 with Java 1.5. I did not yet try 8.8.0.4

Note that it may or may not be possible to optimize the regex in such a
way that it fails earlier on the given string. But my problem is with
the (imo) unpredictability of the performance (or even lockup) with any
less-then-very-trivial regex.


Cheers,
-- Abel Braaksma
  http://www.nuntia.nl

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
saxon-help mailing list
saxon-help@...
https://lists.sourceforge.net/lists/listinfo/saxon-help

Re: Certain non-zero-length non-matching regexes run forever on Saxon

by Owen Rees :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

--On Tuesday, January 23, 2007 17:18:56 +0100 Abel Braaksma wrote:

> I am
> aware of the fact that Saxon relies havily on Sun's Java regex engine,
> but I also know that it does a translation beforehand.

In this case, the problem is in the Java regex engine but according to
<http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6393051> and the
related entries in the Java bug database some matches have exponential time
complexity, at least with the algorithms Java uses so this is not a bug
they intend to fix. (I don't know if there are other algorithms that can
handle such cases.)

An alternative for the particular case that seems not to cause the problem
is:

<xsl:variable name="regex">("[^"]*")+</xsl:variable>

--
Owen Rees
Hewlett Packard Laboratories, Bristol, UK


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
saxon-help mailing list
saxon-help@...
https://lists.sourceforge.net/lists/listinfo/saxon-help

Re: Certain non-zero-length non-matching regexes run forever on Saxon

by Abel Braaksma (online) :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Owen Rees wrote:
> --On Tuesday, January 23, 2007 17:18:56 +0100 Abel Braaksma wrote:
>  
> In this case, the problem is in the Java regex engine but according to
> <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6393051> and the
> related entries in the Java bug database some matches have exponential time
> complexity, at least with the algorithms Java uses so this is not a bug
> they intend to fix. (I don't know if there are other algorithms that can
> handle such cases.)
>  

There are, plenty. Try Ruby, Perl, Python etc. None of these seem to
have the same problem.

> An alternative for the particular case that seems not to cause the problem
> is:
>
> <xsl:variable name="regex">("[^"]*")+</xsl:variable>
>  

Thanks, Owen. That surely solves this particular problem. I didn't want
to go into too much detail, but I run into this problem more often than
I could hope for. I always make sure that the total regex is never
zero-length, but I cannot be sure of the sub matches (not always, at least).

The problem gets more complex in light of bidi text, where this
originated. However, since this seems to be a Java problem, I'll just
have to make sure that I thoroughly test every regex for production
quality...

-- Abel

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
saxon-help mailing list
saxon-help@...
https://lists.sourceforge.net/lists/listinfo/saxon-help

Re: Certain non-zero-length non-matching regexes run forever onSaxon

by Michael Kay :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The specs have nothing to say about performance.

It's useful to have and to share information about badly-performing
constructs, but on the evidence of one example I don't think I plan to do
anything about it. Saxon uses the underlying regex engine in Java or .NET,
and these seems to perform reasonably well most of the time. I could raise
examples of poor performance with the JDK developers, but I'm not sure
whether this would lead to anything: they presumably have a lot more
information about regex performance than I do.

Michael Kay
http://www.saxonica.com/ 

> -----Original Message-----
> From: saxon-help-bounces@...
> [mailto:saxon-help-bounces@...] On Behalf
> Of Abel Braaksma
> Sent: 23 January 2007 16:19
> To: Mailing list for SAXON XSLT queries
> Subject: [saxon] Certain non-zero-length non-matching regexes
> run forever onSaxon
>
> Hi Saxon users,
>
> I don't usually do any cross-posts, but this time I think I
> have to, my apologies beforehand. I abridged the text a bit,
> sorry if it is still rather wieldy. Original subject line:
> "Backtracking and eternal loops caused by regular expressions
> matching: what to expect from implementations?"
>
> [original, altered message]
>
> I happen to run into an XSLT 2 behavior that I am uncertain
> of if it is desired or expected behavior of implementations,
> if it is undesired, but still expected, or if it should be prohibited.
>
> Now, suppose this smallest possible match is zero length. In
> this situation, with classic regex parsers, the engine will
> try forever on a non-matching string (or at least very long).
> In XSLT, this behavior seems to happen even when the
> "smallest possible match" is of non-zero length. Note that
> this applies to subexpressions only, because zero-length
> matching regex (in its entirety) is not allowed in XSLT.
>
> In XSLT this is shown as follows, where I use the example of
> matching a CSV quoted string which may escape the quote by
> doubling it:
> Good csv string examples:
> "well quoted csv string"
> "well ""quoted"" csv string
>
> Bad csv string example:
> not well "quoted csv string
>
> The regular expression to match a quoted CSV string is
> (include quotes):
> "([^"]+|"")*"
>
> This gives trouble in the following XSLT (showing a
> non-matching string), which runs for a very long time
> (exponential performance chart)
>
> <xsl:variable name="ltr">
>    before " after and some more text after </xsl:variable>
> <xsl:variable name="regex">"([^"]+|"")*"</xsl:variable>
>
> <xsl:analyze-string select="$ltr" regex="{$regex}">
>    <xsl:matching-substring>
>       <found><xsl:value-of select="." /></found>
>    </xsl:matching-substring>
>    <xsl:non-matching-substring>
>        <not-found><xsl:value-of select="." /></not-found>
>    </xsl:non-matching-substring>
> </xsl:analyze-string>
>
>
> The above example ran for > 140 minutes and still runs (it
> does not give any problems with matching strings, but does
> give problems with strings that have a large non-matching
> part with a unequal quote count in it). A simpler regex (but
> with less practicability) that shows the same behavior
> (quotes are again part of the regex): "([^"]+)*"
>
> I am under the impression that such behavior is not
> desirable, but I am unsure if there is anything in the specs
> that says something about how implementations should/must
> deal with this. As a comparison, I tried the example with
> Perl, which gave no noticeable performance troubles. I am
> aware of the fact that Saxon relies havily on Sun's Java
> regex engine, but I also know that it does a translation beforehand.
>
> Note that repeating a possible-empty match is dangerous and
> should be avoided. But in the example above, I use a
> subexpression that never matches an empty string (the +
> quantifier means 1 or more) and as such should not happen to
> fall into the eternal-loop problem.
>
> I use Saxon 8.8 with Java 1.5. I did not yet try 8.8.0.4
>
> Note that it may or may not be possible to optimize the regex
> in such a way that it fails earlier on the given string. But
> my problem is with the (imo) unpredictability of the
> performance (or even lockup) with any less-then-very-trivial regex.
>
>
> Cheers,
> -- Abel Braaksma
>   http://www.nuntia.nl
>
> --------------------------------------------------------------
> -----------
> Take Surveys. Earn Cash. Influence the Future of IT Join
> SourceForge.net's Techsay panel and you'll get the chance to
> share your opinions on IT & business topics through brief
> surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge
> &CID=DEVDEV
> _______________________________________________
> saxon-help mailing list
> saxon-help@...
> https://lists.sourceforge.net/lists/listinfo/saxon-help


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
saxon-help mailing list
saxon-help@...
https://lists.sourceforge.net/lists/listinfo/saxon-help

Re: Certain non-zero-length non-matching regexes run forever onSaxon

by Abel Braaksma (online) :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Michael Kay wrote:
> The specs have nothing to say about performance.
>
> It's useful to have and to share information about badly-performing
> constructs, but on the evidence of one example I don't think I plan to do
> anything about it. Saxon uses the underlying regex engine in Java or .NET,
> and these seems to perform reasonably well most of the time. I could raise
> examples of poor performance with the JDK developers, but I'm not sure
> whether this would lead to anything: they presumably have a lot more
> information about regex performance than I do.

Hi Michael,

I used "one example" for the sake of discussion, I don't want to flood
you with all kinds of regexes just proofing my point. But here's perhaps
the simplest I could think of that makes Saxon (or Java for that matter)
run forever. It is a classical example of an exponentially (or
super-linearly) performing regular expression.

However, most engines nowadays short-circuit this (almost) eternal
backtracking, because it is fairly easy to detect (i.e., in general,
when all positions of a string are tested, it is no use to test them
again, in which case the engine can stop and report 'no match', but Java
does (string-length)! tests, which grows pretty quickly).

In Java, it looks like this:

            Pattern ptrn = Pattern.compile("#(.+)+#");
            Matcher matcher = ptrn.matcher("# this text does not match
and runs very long");
           
            while(matcher.find()) {
                System.out.println(matcher.group());
            }
            System.out.println("end");



The regex is designed to find strings anchored by '#' and '#'. However,
the (.+)+ construct is used to illustrate exponential looping. The above
code runs pretty long, maybe hours or days (I stopped it after 5
minutes). Same for XSLT, of course.

Many regexes have a hidden multiple quantifier selection, causing a
lockup of the engine. Page 247 in "Mastering Regular Expressions 2nd Ed"
by J. Friedl, he explains what can be done about it. In Chapter 6, he
explains in depth how backtracking works. Page 226 is about "Exponential
matches".

I understand that this is a Java problem. If you are in need of several
real world examples illustrating this behavior, I can give you plenty.
Often, it is a sign of badly crafted regexes (Owen Rees showed me a
better one) but sometimes there is little avoidance possible. And even
if the regex is badly crafted, should the engine lockup?

-- Abel Braaksma
   http://www.nuntia.nl

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
saxon-help mailing list
saxon-help@...
https://lists.sourceforge.net/lists/listinfo/saxon-help