segfault after regexp

View: New views
19 Messages — Rating Filter:   Alert me  

segfault after regexp

by G.. :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Using the file string.txt I get

octave:1> load "string.txt"
octave:2> regexp(s, '^(\s*-*\d+[.]*\d*\s*)+$', "lineanchors") ;
Segmentation fault

This happens for the following two systems (using octave-3.0.2):
Operating System: Linux 2.6.18-6-amd64 #1 SMP Thu May 8 06:49:39 UTC 2008 x86_64
Operating System: Linux 2.6.27-rc6 #10 SMP Mon Sep 15 18:46:53 CEST 2008 i686

It doesn't happen for
Operating System: Linux 2.4.18-nec3.4p1.045 #1 SMP Mon Apr 9 16:57:17 JST 2007 ia64

G.

Re: segfault after regexp

by Thomas Weber-8 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Am Donnerstag, den 18.09.2008, 04:26 -0700 schrieb G..:

> Using the file  http://www.nabble.com/file/p19550666/string.txt string.txt  I
> get
>
> octave:1> load "string.txt"
> octave:2> regexp(s, '^(\s*-*\d+[.]*\d*\s*)+$', "lineanchors") ;
> Segmentation fault
>
> This happens for the following two systems (using octave-3.0.2):
> Operating System: Linux 2.6.18-6-amd64 #1 SMP Thu May 8 06:49:39 UTC 2008
> x86_64
> Operating System: Linux 2.6.27-rc6 #10 SMP Mon Sep 15 18:46:53 CEST 2008
> i686
>
> It doesn't happen for
> Operating System: Linux 2.4.18-nec3.4p1.045 #1 SMP Mon Apr 9 16:57:17 JST
> 2007 ia64

Which version of pcre? On all systems, please.

        Thomas

_______________________________________________
Bug-octave mailing list
Bug-octave@...
https://www-old.cae.wisc.edu/mailman/listinfo/bug-octave

segfault after regexp

by John W. Eaton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 18-Sep-2008, G.. wrote:

| Using the file  http://www.nabble.com/file/p19550666/string.txt string.txt  I
| get
|
| octave:1> load "string.txt"
| octave:2> regexp(s, '^(\s*-*\d+[.]*\d*\s*)+$', "lineanchors") ;
| Segmentation fault
|
| This happens for the following two systems (using octave-3.0.2):
| Operating System: Linux 2.6.18-6-amd64 #1 SMP Thu May 8 06:49:39 UTC 2008
| x86_64
| Operating System: Linux 2.6.27-rc6 #10 SMP Mon Sep 15 18:46:53 CEST 2008
| i686
|
| It doesn't happen for
| Operating System: Linux 2.4.18-nec3.4p1.045 #1 SMP Mon Apr 9 16:57:17 JST
| 2007 ia64

Running Octave under gdb and trying this example, I see:

  octave:1> load string.txt
  octave:2> regexp(s, '^(\s*-*\d+[.]*\d*\s*)+$', "lineanchors") ;

  Program received signal SIGSEGV, Segmentation fault.
  [Switching to Thread 0x7f4a34bd76f0 (LWP 7894)]
  0x00007f4a2bfa51f8 in ?? () from /usr/lib/libpcre.so.3
  (gdb) bt
  #0  0x00007f4a2bfa51f8 in ?? () from /usr/lib/libpcre.so.3
  #1  0x00007f4a2bfa5214 in ?? () from /usr/lib/libpcre.so.3
  #2  0x00007f4a2bf9f998 in ?? () from /usr/lib/libpcre.so.3
  #3  0x00007f4a2bfa39ec in ?? () from /usr/lib/libpcre.so.3
  #4  0x00007f4a2bfa5214 in ?? () from /usr/lib/libpcre.so.3
  #5  0x00007f4a2bfa5214 in ?? () from /usr/lib/libpcre.so.3
  #6  0x00007f4a2bfa758c in ?? () from /usr/lib/libpcre.so.3
  #7  0x00007f4a2bfa5214 in ?? () from /usr/lib/libpcre.so.3
  #8  0x00007f4a2bfa5214 in ?? () from /usr/lib/libpcre.so.3
  #9  0x00007f4a2bf9f998 in ?? () from /usr/lib/libpcre.so.3
  #10 0x00007f4a2bfa39ec in ?? () from /usr/lib/libpcre.so.3
  [...]
  #2174 0x00007f4a2bfa5214 in ?? () from /usr/lib/libpcre.so.3
  #2175 0x00007f4a2bfa5214 in ?? () from /usr/lib/libpcre.so.3
  #2176 0x00007f4a2bfa758c in ?? () from /usr/lib/libpcre.so.3
  [...]
  [etc.]
  [etc.]
  [etc.]

So this appears to be an infinite recursion bug in the PCRE library.

jwe
_______________________________________________
Bug-octave mailing list
Bug-octave@...
https://www-old.cae.wisc.edu/mailman/listinfo/bug-octave

Re: segfault after regexp

by John W. Eaton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 18-Sep-2008, Thomas Weber wrote:

| Am Donnerstag, den 18.09.2008, 04:26 -0700 schrieb G..:
| > Using the file  http://www.nabble.com/file/p19550666/string.txt string.txt  I
| > get
| >
| > octave:1> load "string.txt"
| > octave:2> regexp(s, '^(\s*-*\d+[.]*\d*\s*)+$', "lineanchors") ;
| > Segmentation fault
| >
| > This happens for the following two systems (using octave-3.0.2):
| > Operating System: Linux 2.6.18-6-amd64 #1 SMP Thu May 8 06:49:39 UTC 2008
| > x86_64
| > Operating System: Linux 2.6.27-rc6 #10 SMP Mon Sep 15 18:46:53 CEST 2008
| > i686
| >
| > It doesn't happen for
| > Operating System: Linux 2.4.18-nec3.4p1.045 #1 SMP Mon Apr 9 16:57:17 JST
| > 2007 ia64
|
| Which version of pcre? On all systems, please.

For me, it is

segfault:379> dpkg -l *pcre*
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Cfg-files/Unpacked/Failed-cfg/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Hold/Reinst-required/X=both-problems (Status,Err: uppercase=bad)
||/ Name           Version        Description
+++-==============-==============-============================================
ii  libpcre3       7.6-2.1        Perl 5 Compatible Regular Expression Library
ii  libpcre3-dev   7.6-2.1        Perl 5 Compatible Regular Expression Library
ii  libpcrecpp0    7.6-2.1        Perl 5 Compatible Regular Expression Library

on a Debian AMD64 system.

jwe
_______________________________________________
Bug-octave mailing list
Bug-octave@...
https://www-old.cae.wisc.edu/mailman/listinfo/bug-octave

Re: segfault after regexp

by G.. :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thomas Weber-8 wrote:
> This happens for the following two systems (using octave-3.0.2):
> Operating System: Linux 2.6.18-6-amd64 #1 SMP Thu May 8 06:49:39 UTC 2008
> x86_64
> Operating System: Linux 2.6.27-rc6 #10 SMP Mon Sep 15 18:46:53 CEST 2008
> i686
>
> It doesn't happen for
> Operating System: Linux 2.4.18-nec3.4p1.045 #1 SMP Mon Apr 9 16:57:17 JST
> 2007 ia64

Which version of pcre? On all systems, please.
pcre 7.8 for a) and c), and 7.4 for b) (where it was actually octave-3.0.0 from Ubuntu Hardy).

Re: segfault after regexp

by Sergei Steshenko-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message




--- On Thu, 9/18/08, G.. <gail@...> wrote:

> From: G.. <gail@...>
> Subject: segfault after regexp
> To: bug-octave@...
> Date: Thursday, September 18, 2008, 4:26 AM
> Using the file
> http://www.nabble.com/file/p19550666/string.txt string.txt
> I
> get
>
> octave:1> load "string.txt"
> octave:2> regexp(s,
> '^(\s*-*\d+[.]*\d*\s*)+$',
> "lineanchors") ;
> Segmentation fault
>
> This happens for the following two systems (using
> octave-3.0.2):
> Operating System: Linux 2.6.18-6-amd64 #1 SMP Thu May 8
> 06:49:39 UTC 2008
> x86_64
> Operating System: Linux 2.6.27-rc6 #10 SMP Mon Sep 15
> 18:46:53 CEST 2008
> i686
>
> It doesn't happen for
> Operating System: Linux 2.4.18-nec3.4p1.045 #1 SMP Mon Apr
> 9 16:57:17 JST
> 2007 ia64
>
> G.
> --

I confirm  - happens on both 3.0.1 and 3.0.2 - self built octave and its
dependencies.

Regards,
  Sergei.


     
_______________________________________________
Bug-octave mailing list
Bug-octave@...
https://www-old.cae.wisc.edu/mailman/listinfo/bug-octave

Re: segfault after regexp

by Thomas Weber-8 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu, Sep 18, 2008 at 04:26:27AM -0700, G.. wrote:
>
> Using the file  http://www.nabble.com/file/p19550666/string.txt string.txt  I
> get
>
> octave:1> load "string.txt"
> octave:2> regexp(s, '^(\s*-*\d+[.]*\d*\s*)+$', "lineanchors") ;
> Segmentation fault

I suspect a stack overflow here.

G., you did read Philip's comments about nested unlimited repeats,
didn't you?

http://lists.exim.org/lurker/message/20080918.084230.037a0008.en.html

This works on my system if I increase the stack size limit (w.m contains
your commands):

$ ulimit -s
8192
$ ./run-octave -q
octave:1> w
Segmentation fault

$ ulimit -s 16000
$ ./run-octave -q
octave:1> w
ans =  1


 
> This happens for the following two systems (using octave-3.0.2):
> Operating System: Linux 2.6.18-6-amd64 #1 SMP Thu May 8 06:49:39 UTC 2008
> x86_64
> Operating System: Linux 2.6.27-rc6 #10 SMP Mon Sep 15 18:46:53 CEST 2008
> i686
>
> It doesn't happen for
> Operating System: Linux 2.4.18-nec3.4p1.045 #1 SMP Mon Apr 9 16:57:17 JST
> 2007 ia64

I suspect the Itanium has a bigger default stack size.

        Thomas
_______________________________________________
Bug-octave mailing list
Bug-octave@...
https://www-old.cae.wisc.edu/mailman/listinfo/bug-octave

Re: segfault after regexp

by G.. () :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thomas Weber-8 wrote:
$ ulimit -s 16000
$ ./run-octave -q
octave:1> w
ans =  1
That works. However, for larger data (e.g. by simply doubling the input string) it doesn't.

G.

Re: segfault after regexp

by Thomas Weber-8 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, Sep 28, 2008 at 02:15:47AM -0700, G.. wrote:

>
>
> Thomas Weber-8 wrote:
> >
> > $ ulimit -s 16000
> > $ ./run-octave -q
> > octave:1> w
> > ans =  1
> >
>
> That works. However, for larger data (e.g. by simply doubling the input
> string) it doesn't.

You are running into a constraint/protection by your operating system.

Increase the size even more or (better): check if you can't come up with
better regexps.

        Thomas
_______________________________________________
Bug-octave mailing list
Bug-octave@...
https://www-old.cae.wisc.edu/mailman/listinfo/bug-octave

Re: segfault after regexp

by G.. :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thomas Weber-8 wrote:
> > $ ulimit -s 16000
> > $ ./run-octave -q
> > octave:1> w
> > ans =  1
> >
>
> That works. However, for larger data (e.g. by simply doubling the input
> string) it doesn't.

You are running into a constraint/protection by your operating system.

Increase the size even more or (better): check if you can't come up with
better regexps.
a) this may be ignorant, but summarizing this means that - on the same machine with ulimit 16000 - I get:

octave:> regexp(s, '^(\s*-*\d+[.]*\d*\s*)+$', 'lineanchors')
Segmentation fault

matlab:> regexp(repmat(s,1,1000), '^(\s*-*\d+[.]*\d*\s*)+$')

ans =

     1

After enlarging the input string to its limits, matlab finally gives a message that the array is too large, but never segfaults.

b) w.r.t. the regexp, I really don't see how to describe a sequence of spaces and numbers more simply. And the data (the input string) is pretty tame.

G.

Re: segfault after regexp

by John W. Eaton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 29-Sep-2008, G.. wrote:

| a) this may be ignorant, but summarizing this means that - on the same
| machine with ulimit 16000 - I get:
|
| octave:> regexp(s, '^(\s*-*\d+[.]*\d*\s*)+$', 'lineanchors')
| Segmentation fault
|
| matlab:> regexp(repmat(s,1,1000), '^(\s*-*\d+[.]*\d*\s*)+$')
|
| ans =
|
|      1
|
| After enlarging the input string to its limits, matlab finally gives a
| message that the array is too large, but never segfaults.

I think it has been mentioned before that Matlab uses its own regexp
library that has its own set of bugs.

Since Octave just sets up a call to the PCRE library, I think the bug
is probably in the PCRE library, and the right place to fix it is
there.  Unless maybe there is a way now to have PCRE detect this
problem and return an error code instead of infinitely recursing and
causing a segfault.

Hmm.  I expected to be able to also generate a segfault with the
pcretest program, but I couldn't make that happen.  So maybe Octave is
calling the PCRE functions incorrectly?  I'm not sure, and not an
expert here, so it would be helpful if someone could help debug the
problem, or verify that there is a bug in the PCRE library that should
be fixed.

Thanks,

jwe
_______________________________________________
Bug-octave mailing list
Bug-octave@...
https://www-old.cae.wisc.edu/mailman/listinfo/bug-octave

Re: segfault after regexp

by Thomas Weber-8 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, Sep 29, 2008 at 02:19:41PM -0400, John W. Eaton wrote:

> On 29-Sep-2008, G.. wrote:
>
> | a) this may be ignorant, but summarizing this means that - on the same
> | machine with ulimit 16000 - I get:
> |
> | octave:> regexp(s, '^(\s*-*\d+[.]*\d*\s*)+$', 'lineanchors')
> | Segmentation fault
> |
> | matlab:> regexp(repmat(s,1,1000), '^(\s*-*\d+[.]*\d*\s*)+$')
> |
> | ans =
> |
> |      1
> |
> | After enlarging the input string to its limits, matlab finally gives a
> | message that the array is too large, but never segfaults.
>
> I think it has been mentioned before that Matlab uses its own regexp
> library that has its own set of bugs.
>
> Since Octave just sets up a call to the PCRE library, I think the bug
> is probably in the PCRE library, and the right place to fix it is
> there.  Unless maybe there is a way now to have PCRE detect this
> problem and return an error code instead of infinitely recursing and
> causing a segfault.

Actually it's a SIGSEGV.

It might be possible to change regexp.cc to set the soft limit for stack
recursion (the equivalent of the above 'ulimit -s' command) to the hard
limit.

I don't know however what kind of consequences this has for the system
in question.

        Thomas
_______________________________________________
Bug-octave mailing list
Bug-octave@...
https://www-old.cae.wisc.edu/mailman/listinfo/bug-octave

Re: segfault after regexp

by John W. Eaton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 29-Sep-2008, Thomas Weber wrote:

| On Mon, Sep 29, 2008 at 02:19:41PM -0400, John W. Eaton wrote:
| > On 29-Sep-2008, G.. wrote:
| >
| > | a) this may be ignorant, but summarizing this means that - on the same
| > | machine with ulimit 16000 - I get:
| > |
| > | octave:> regexp(s, '^(\s*-*\d+[.]*\d*\s*)+$', 'lineanchors')
| > | Segmentation fault
| > |
| > | matlab:> regexp(repmat(s,1,1000), '^(\s*-*\d+[.]*\d*\s*)+$')
| > |
| > | ans =
| > |
| > |      1
| > |
| > | After enlarging the input string to its limits, matlab finally gives a
| > | message that the array is too large, but never segfaults.
| >
| > I think it has been mentioned before that Matlab uses its own regexp
| > library that has its own set of bugs.
| >
| > Since Octave just sets up a call to the PCRE library, I think the bug
| > is probably in the PCRE library, and the right place to fix it is
| > there.  Unless maybe there is a way now to have PCRE detect this
| > problem and return an error code instead of infinitely recursing and
| > causing a segfault.
|
| Actually it's a SIGSEGV.

There's a difference?

| It might be possible to change regexp.cc to set the soft limit for stack
| recursion (the equivalent of the above 'ulimit -s' command) to the hard
| limit.
|
| I don't know however what kind of consequences this has for the system
| in question.

I think we should first find out why this is going into an apparently
infinite recursion.  If it is an error in the way that we are using
the PCRE functions, then maybe we can fix it.  Otherwise, I think the
bug should be fixed in PCRE.

jwe

_______________________________________________
Bug-octave mailing list
Bug-octave@...
https://www-old.cae.wisc.edu/mailman/listinfo/bug-octave

Re: segfault after regexp

by Thomas Weber-8 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Am Montag, den 29.09.2008, 17:07 -0400 schrieb John W. Eaton:

> On 29-Sep-2008, Thomas Weber wrote:
>
> | On Mon, Sep 29, 2008 at 02:19:41PM -0400, John W. Eaton wrote:
> | > On 29-Sep-2008, G.. wrote:
> | >
> | > | a) this may be ignorant, but summarizing this means that - on the same
> | > | machine with ulimit 16000 - I get:
> | > |
> | > | octave:> regexp(s, '^(\s*-*\d+[.]*\d*\s*)+$', 'lineanchors')
> | > | Segmentation fault
> | > |
> | > | matlab:> regexp(repmat(s,1,1000), '^(\s*-*\d+[.]*\d*\s*)+$')
> | > |
> | > | ans =
> | > |
> | > |      1
> | > |
> | > | After enlarging the input string to its limits, matlab finally gives a
> | > | message that the array is too large, but never segfaults.
> | >
> | > I think it has been mentioned before that Matlab uses its own regexp
> | > library that has its own set of bugs.
> | >
> | > Since Octave just sets up a call to the PCRE library, I think the bug
> | > is probably in the PCRE library, and the right place to fix it is
> | > there.  Unless maybe there is a way now to have PCRE detect this
> | > problem and return an error code instead of infinitely recursing and
> | > causing a segfault.
> |
> | Actually it's a SIGSEGV.
>
> There's a difference?

I thought, but I was wrong. Okay, forget that.

Anyway, we could catch the SIGSEGV from the pcre library and proceed
accordingly.

>
> | It might be possible to change regexp.cc to set the soft limit for stack
> | recursion (the equivalent of the above 'ulimit -s' command) to the hard
> | limit.
> |
> | I don't know however what kind of consequences this has for the system
> | in question.
>
> I think we should first find out why this is going into an apparently
> infinite recursion.  If it is an error in the way that we are using
> the PCRE functions, then maybe we can fix it.  Otherwise, I think the
> bug should be fixed in PCRE.

I don't think it's an infinite recursion (it works when given enough
space, so it's definitely finite).

There are already several options in the PCRE library, including a
different implementation for regexps like this.

        Thomas

_______________________________________________
Bug-octave mailing list
Bug-octave@...
https://www-old.cae.wisc.edu/mailman/listinfo/bug-octave

Re: segfault after regexp

by John W. Eaton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 30-Sep-2008, Thomas Weber wrote:

| Am Montag, den 29.09.2008, 17:07 -0400 schrieb John W. Eaton:
| > On 29-Sep-2008, Thomas Weber wrote:
| >
| > | On Mon, Sep 29, 2008 at 02:19:41PM -0400, John W. Eaton wrote:
| > | > On 29-Sep-2008, G.. wrote:
| > | >
| > | > | a) this may be ignorant, but summarizing this means that - on the same
| > | > | machine with ulimit 16000 - I get:
| > | > |
| > | > | octave:> regexp(s, '^(\s*-*\d+[.]*\d*\s*)+$', 'lineanchors')
| > | > | Segmentation fault
| > | > |
| > | > | matlab:> regexp(repmat(s,1,1000), '^(\s*-*\d+[.]*\d*\s*)+$')
| > | > |
| > | > | ans =
| > | > |
| > | > |      1
| > | > |
| > | > | After enlarging the input string to its limits, matlab finally gives a
| > | > | message that the array is too large, but never segfaults.
| > | >
| > | > I think it has been mentioned before that Matlab uses its own regexp
| > | > library that has its own set of bugs.
| > | >
| > | > Since Octave just sets up a call to the PCRE library, I think the bug
| > | > is probably in the PCRE library, and the right place to fix it is
| > | > there.  Unless maybe there is a way now to have PCRE detect this
| > | > problem and return an error code instead of infinitely recursing and
| > | > causing a segfault.
| > |
| > | Actually it's a SIGSEGV.
| >
| > There's a difference?
|
| I thought, but I was wrong. Okay, forget that.
|
| Anyway, we could catch the SIGSEGV from the pcre library and proceed
| accordingly.

There is already a handler installed for SIGSEGV, but I think it fails
in this instance.  I'm not certain why that happens, but my guess is
that, calling the signal handler fails if there is no more stack space.

| > | It might be possible to change regexp.cc to set the soft limit for stack
| > | recursion (the equivalent of the above 'ulimit -s' command) to the hard
| > | limit.
| > |
| > | I don't know however what kind of consequences this has for the system
| > | in question.
| >
| > I think we should first find out why this is going into an apparently
| > infinite recursion.  If it is an error in the way that we are using
| > the PCRE functions, then maybe we can fix it.  Otherwise, I think the
| > bug should be fixed in PCRE.
|
| I don't think it's an infinite recursion (it works when given enough
| space, so it's definitely finite).

OK, but it seems like a very large number of recursive calls for what
seems to be a relatively simple regexp operating on what also seems to
be a small amount of data.

| There are already several options in the PCRE library, including a
| different implementation for regexps like this.

So should we be trying to recognize the characteristics of the regexp
and set some options before calling PCRE?  It seems to me that job
should be handled by PCRE itself.  We are just users of the library.
How are we supposed to know what kinds of regexps will cause trouble?

jwe
_______________________________________________
Bug-octave mailing list
Bug-octave@...
https://www-old.cae.wisc.edu/mailman/listinfo/bug-octave

Re: segfault after regexp

by Thomas Weber-8 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Sep 30, 2008 at 04:16:29PM -0400, John W. Eaton wrote:
> On 30-Sep-2008, Thomas Weber wrote:
> There is already a handler installed for SIGSEGV, but I think it fails
> in this instance.  I'm not certain why that happens, but my guess is
> that, calling the signal handler fails if there is no more stack space.

Yes, according to sigaltstack(2), that's the problem. (I'm mentioning
sigaltstack here mostly for reference, so I don't have to search the net
again).

I fear however that this will turn into a very system-specific solution.

>
> | > | It might be possible to change regexp.cc to set the soft limit for stack
> | > | recursion (the equivalent of the above 'ulimit -s' command) to the hard
> | > | limit.
> | > |
> | > | I don't know however what kind of consequences this has for the system
> | > | in question.
> | >
> | > I think we should first find out why this is going into an apparently
> | > infinite recursion.  If it is an error in the way that we are using
> | > the PCRE functions, then maybe we can fix it.  Otherwise, I think the
> | > bug should be fixed in PCRE.
> |
> | I don't think it's an infinite recursion (it works when given enough
> | space, so it's definitely finite).
>
> OK, but it seems like a very large number of recursive calls for what
> seems to be a relatively simple regexp operating on what also seems to
> be a small amount of data.

According to pcrestack(3), that's a problem that might happen with
nested, unlimited regexps.
 
> | There are already several options in the PCRE library, including a
> | different implementation for regexps like this.
>
> So should we be trying to recognize the characteristics of the regexp
> and set some options before calling PCRE?  

I don't think we will have much luck in recognizing the characteristics
of a regexp. If the data is trivial, even the most complicated regexp
will work; vice versa, with enough data, even simple regexp's might run
into this.


> should be handled by PCRE itself.  We are just users of the library.
> How are we supposed to know what kinds of regexps will cause trouble?

Well, quoting pcrestack's man page:
"As a very rough rule of thumb, you should reckon on about 500 bytes per
recursion. Thus, if you want to limit your stack usage to 8Mb, you
should set the limit at 16000 recursions. A 64Mb stack, on the  other
hand,  can support around 128000 recursions. The pcretest test program
has a command line option (-S) that can be used to increase the size of
its stack."

So, we have some estimates, with a security factor of (say) 2, we should
be alright.

This doesn't address the important question though: what kind of memory
limit do we pose on the stack?

        Thomas
_______________________________________________
Bug-octave mailing list
Bug-octave@...
https://www-old.cae.wisc.edu/mailman/listinfo/bug-octave

Re: segfault after regexp

by Thomas Weber-8 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sat, Oct 04, 2008 at 11:40:04AM +0200, Thomas Weber wrote:

> Well, quoting pcrestack's man page:
> "As a very rough rule of thumb, you should reckon on about 500 bytes per
> recursion. Thus, if you want to limit your stack usage to 8Mb, you
> should set the limit at 16000 recursions. A 64Mb stack, on the  other
> hand,  can support around 128000 recursions. The pcretest test program
> has a command line option (-S) that can be used to increase the size of
> its stack."
>
> So, we have some estimates, with a security factor of (say) 2, we should
> be alright.
>
> This doesn't address the important question though: what kind of memory
> limit do we pose on the stack?
Patch attached. I assume a maximum of 500MB on the stack (if there's no
hard limit), with a safety factor of 2.

        Thomas


# HG changeset patch
# User Thomas Weber <thomas.weber.mail@...>
# Date 1223729321 -7200
# Node ID f89e3a3bf4d106ee3d297243e42768d1b5213703
# Parent  a10397d26114998bca6c7c5570eb2feb40f77b91
Set a sensible limit on stack usage

diff --git a/src/DLD-FUNCTIONS/regexp.cc b/src/DLD-FUNCTIONS/regexp.cc
--- a/src/DLD-FUNCTIONS/regexp.cc
+++ b/src/DLD-FUNCTIONS/regexp.cc
@@ -52,9 +52,10 @@
 #include <regex.h>
 #endif
 
-// Define the maximum number of retries for a pattern that
-// possibly results in an infinite recursion.
-#define PCRE_MATCHLIMIT_MAX 10
+// Define a safety factor for PCRE's estimated stack usage
+// Used to protect against stack overflow and to estimate the maximum
+// number of recursions.
+#define PCRE_STACK_SAFETY_FACTOR 2
 
 // The regexp is constructed as a linked list to avoid resizing the
 // return values in arrays at each new match.
@@ -384,31 +385,58 @@
  {
   OCTAVE_QUIT;
 
-  int matches = pcre_exec(re, 0, buffer.c_str(),
+  // pcre_exec uses recursion aggressively, therefore we may
+  // run out of stack in the call to pcre_exec() if we use the
+  // platform's default values.
+
+  // If the hard limit from getrlimit() is unlimited, we set
+  // the stack limit to 500MB and set the number of recursions
+  // accordingly: one recursion needs approximately 500 bytes
+  // according to pcrestack(3) and we use a safety factor of
+  // PCRE_STACK_SAFETY_FACTOR.
+
+  // If the hard limit from getrlimit() is lower, we set the
+  // limit on the number of function executions to a suitable
+  // value.
+
+  // query the limits
+  pcre_extra pe;
+  pcre_config(PCRE_CONFIG_MATCH_LIMIT, static_cast <void *> (&pe.match_limit));
+  pe.flags = PCRE_EXTRA_MATCH_LIMIT;
+
+  struct rlimit rlim;
+  getrlimit(RLIMIT_STACK, &rlim);
+
+  if (rlim.rlim_max == RLIM_INFINITY)
+    // no hard limit, so we limit ourselves to 500 MB
+    rlim.rlim_cur = static_cast <rlim_t> (500 * 1024 * 1024);
+  else
+    // set soft limit to hard limit
+    rlim.rlim_cur = rlim.rlim_max;
+  
+  pe.match_limit = rlim.rlim_cur / (500 * PCRE_STACK_SAFETY_FACTOR);
+  
+  if (setrlimit(RLIMIT_STACK, &rlim) != 0)
+    {
+      error ("%s: increasing stack limit for PCRE usage failed", nm.c_str());
+      pcre_free(re);
+      return 0;
+    }
+  
+  int matches = pcre_exec(re, &pe, buffer.c_str(),
   buffer.length(), idx,
   (idx ? PCRE_NOTBOL : 0),
   ovector, (subpatterns+1)*3);
+  
 
   if (matches == PCRE_ERROR_MATCHLIMIT)
     {
-      // try harder; start with default value for MATCH_LIMIT and increase it
-      warning("Your pattern caused PCRE to hit its MATCH_LIMIT.\nTrying harder now, but this will be slow.");
-      pcre_extra pe;
-      pcre_config(PCRE_CONFIG_MATCH_LIMIT, static_cast <void *> (&pe.match_limit));
-      pe.flags = PCRE_EXTRA_MATCH_LIMIT;
-
-      int i = 0;
-      while (matches == PCRE_ERROR_MATCHLIMIT &&
-     i++ < PCRE_MATCHLIMIT_MAX)
- {
-  OCTAVE_QUIT;
-
-  pe.match_limit *= 10;
-  matches = pcre_exec(re, &pe, buffer.c_str(),
-      buffer.length(), idx,
-      (idx ? PCRE_NOTBOL : 0),
-      ovector, (subpatterns+1)*3);
- }
+      // we've hit the match limit, despite setting it to a
+      // big value above. Inform the user and fail gracefully
+      error("%s: Your pattern caused PCRE to hit its MATCH_LIMIT. "
+                    "If you have a hard limit on stack usage set, try to set it higher.", nm.c_str());
+      pcre_free(re);
+      return 0;
     }
 
   if (matches < 0 && matches != PCRE_ERROR_NOMATCH)


_______________________________________________
Bug-octave mailing list
Bug-octave@...
https://www-old.cae.wisc.edu/mailman/listinfo/bug-octave

Re: segfault after regexp

by John W. Eaton :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 11-Oct-2008, Thomas Weber wrote:

| On Sat, Oct 04, 2008 at 11:40:04AM +0200, Thomas Weber wrote:
| > Well, quoting pcrestack's man page:
| > "As a very rough rule of thumb, you should reckon on about 500 bytes per
| > recursion. Thus, if you want to limit your stack usage to 8Mb, you
| > should set the limit at 16000 recursions. A 64Mb stack, on the  other
| > hand,  can support around 128000 recursions. The pcretest test program
| > has a command line option (-S) that can be used to increase the size of
| > its stack."
| >
| > So, we have some estimates, with a security factor of (say) 2, we should
| > be alright.
| >
| > This doesn't address the important question though: what kind of memory
| > limit do we pose on the stack?
|
| Patch attached. I assume a maximum of 500MB on the stack (if there's no
| hard limit), with a safety factor of 2.

I don't think getrlimit and setrlimit are portable, so at a minimum,
you'll need a configure check and only use this method if thse
functions are available.

But is this really the right place for the fix?  Or even the right
approach to take?  Octave is not the only program using PCRE that
might run into this problem.  It seems to me that it would be better
to fix it in PCRE itself, preferably by using a different algorithm
that doesn't suffer from these problems.  Modifying the stack limit
does not seem like a real fix to the actual problem.  Instead, you are
just hiding it.  The problem still exists, and will still bite for
larger problems or more complex data.

jwe
_______________________________________________
Bug-octave mailing list
Bug-octave@...
https://www-old.cae.wisc.edu/mailman/listinfo/bug-octave

Re: segfault after regexp

by Thomas Weber-8 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sat, Oct 11, 2008 at 09:09:24AM -0400, John W. Eaton wrote:

> On 11-Oct-2008, Thomas Weber wrote:
>
> | On Sat, Oct 04, 2008 at 11:40:04AM +0200, Thomas Weber wrote:
> | > Well, quoting pcrestack's man page:
> | > "As a very rough rule of thumb, you should reckon on about 500 bytes per
> | > recursion. Thus, if you want to limit your stack usage to 8Mb, you
> | > should set the limit at 16000 recursions. A 64Mb stack, on the  other
> | > hand,  can support around 128000 recursions. The pcretest test program
> | > has a command line option (-S) that can be used to increase the size of
> | > its stack."
> | >
> | > So, we have some estimates, with a security factor of (say) 2, we should
> | > be alright.
> | >
> | > This doesn't address the important question though: what kind of memory
> | > limit do we pose on the stack?
> |
> | Patch attached. I assume a maximum of 500MB on the stack (if there's no
> | hard limit), with a safety factor of 2.
>
> I don't think getrlimit and setrlimit are portable, so at a minimum,
> you'll need a configure check and only use this method if thse
> functions are available.

I actually thought they are in POSIX. Can people with different systems
comment? For that matter, does the original crash happen on Windows or
Mac?
 
> But is this really the right place for the fix?  Or even the right
> approach to take?  Octave is not the only program using PCRE that
> might run into this problem.  

Eh, yes:
PHP:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=476419

PCRE itself:
http://bugs.exim.org/show_bug.cgi?id=704


> It seems to me that it would be better to fix it in PCRE itself,
> preferably by using a different algorithm that doesn't suffer from
> these problems.  

There is a different algorithm already implemented, pcre_dfa_exec().
It's not Perl compatible, though.

Reading through its documentation, we will just hit a different problem,
though: it needs a workspace for saving the number of different possible
matches. So we would need to choose how many partial matches we would
like to track (man pcreapi for details).

> Modifying the stack limit does not seem like a real fix to the actual
> problem.  Instead, you are just hiding it.  The problem still exists,
> and will still bite for larger problems or more complex data.

Sorry, but with enough data, your RAM won't handle that, either. There's
a limit on how much we can cater:
1) When compiling PCRE, the user has chosen a far too large recursion
limit.
2) The soft limit in his shell on stack usage is too low for the value
from 1).

3) There comes Octave, simply using what it is told to use and not
working with it. But now Octave should overcome 1) and 2)? I'd say if it
was trivial to overcome, PCRE would handle it itself.

PCRE's default usage means aggressive recursion, how should we change
that?

        Thomas
_______________________________________________
Bug-octave mailing list
Bug-octave@...
https://www-old.cae.wisc.edu/mailman/listinfo/bug-octave