Nondeterministic segmentation fault in find_in_catalogue (sgml2pl.so)

View: New views
7 Messages — Rating Filter:   Alert me  

Nondeterministic segmentation fault in find_in_catalogue (sgml2pl.so)

by Roberto Bagnara :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


We are observing a nondeterministic segmentation fault in the function
find_in_catalogue of module sgml2pl.so.

We have observed that this function has a very large stack frame
and the segmentation fault happens just after a stack pointer
decrease of more than 80 kB.

We suspect that the stack expander gets confused and this is
the reason of the segmentation fault, which may be nondeterministic
due to the randomization of the begin-of-stack in Linux.
Cheers,

     Abramo and Roberto Bagnara

--
Prof. Roberto Bagnara
Computer Science Group
Department of Mathematics, University of Parma, Italy
http://www.cs.unipr.it/~bagnara/
mailto:bagnara@...
_______________________________________________
SWI-Prolog mailing list
SWI-Prolog@...
https://mailbox.iai.uni-bonn.de/mailman/listinfo.cgi/swi-prolog

Re: Nondeterministic segmentation fault in find_in_catalogue (sgml2pl.so)

by Jan Wielemaker-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, 2009-09-29 at 21:01 +0200, Roberto Bagnara wrote:

> We are observing a nondeterministic segmentation fault in the function
> find_in_catalogue of module sgml2pl.so.
>
> We have observed that this function has a very large stack frame
> and the segmentation fault happens just after a stack pointer
> decrease of more than 80 kB.
>
> We suspect that the stack expander gets confused and this is
> the reason of the segmentation fault, which may be nondeterministic
> due to the randomization of the begin-of-stack in Linux.

Try to compile with -g and get line-number and further info about the
crash.  Recent changes in 5.7.x fix at least two concurrency  issues
in error-recovery of the parser, but this seems unrelated.

        Cheers --- Jan

_______________________________________________
SWI-Prolog mailing list
SWI-Prolog@...
https://mailbox.iai.uni-bonn.de/mailman/listinfo.cgi/swi-prolog

Re: Nondeterministic segmentation fault in find_in_catalogue (sgml2pl.so)

by Roberto Bagnara :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Jan Wielemaker wrote:

> On Tue, 2009-09-29 at 21:01 +0200, Roberto Bagnara wrote:
>> We are observing a nondeterministic segmentation fault in the function
>> find_in_catalogue of module sgml2pl.so.
>>
>> We have observed that this function has a very large stack frame
>> and the segmentation fault happens just after a stack pointer
>> decrease of more than 80 kB.
>>
>> We suspect that the stack expander gets confused and this is
>> the reason of the segmentation fault, which may be nondeterministic
>> due to the randomization of the begin-of-stack in Linux.
>
> Try to compile with -g and get line-number and further info about the
> crash.  Recent changes in 5.7.x fix at least two concurrency  issues
> in error-recovery of the parser, but this seems unrelated.

Hi Jan,

I have done what you suggested.  Here are the results:

Program terminated with signal 11, Segmentation fault.
[New process 30680]
#0  find_in_catalogue (kind=3, name=0x1c05cc0, pubid=0x0, sysid=0x0, ci=1)
     at catalog.c:548
548 { ichar penname[FILENAME_MAX];
(gdb) frame 0
#0  find_in_catalogue (kind=3, name=0x1c05cc0, pubid=0x0, sysid=0x0, ci=1)
     at catalog.c:548
548 { ichar penname[FILENAME_MAX];
Current language:  auto; currently c
(gdb) list
543
544 ichar const *
545 find_in_catalogue(int kind,
546  ichar const *name,
547  ichar const *pubid, ichar const *sysid, int ci)
548 { ichar penname[FILENAME_MAX];
549  const size_t penlen = sizeof(penname)/sizeof(ichar);
550  catalogue_item_ptr item;
551  ichar const *result;
552  catalog_file *catfile;
(gdb) x/i $pc
0x7fffacc9eed9 <find_in_catalogue+25>: mov    %rsi,0x20(%rsp)
(gdb)

And here is the assembly code for find_in_catalogue:

0000000000010ec0 <find_in_catalogue>:
    10ec0:       41 57                   push   %r15
    10ec2:       31 c0                   xor    %eax,%eax
    10ec4:       41 89 ff                mov    %edi,%r15d
    10ec7:       41 56                   push   %r14
    10ec9:       41 55                   push   %r13
    10ecb:       41 54                   push   %r12
    10ecd:       49 89 cc                mov    %rcx,%r12
    10ed0:       55                      push   %rbp
    10ed1:       53                      push   %rbx
    10ed2:       48 81 ec 48 40 01 00    sub    $0x14048,%rsp     (1)
    10ed9:       48 89 74 24 20          mov    %rsi,0x20(%rsp)   (2)

As you can see, the stack is extended of 81992 bytes (hex 14048)
in instruction (1).  At the subsequent instruction (2), i.e.,
at the first access via the stack pointer, we have the segmentation
fault.

I wonder if something is not working in the automatic stack extension
mechanism.
Cheers,

     Roberto

--
Prof. Roberto Bagnara
Computer Science Group
Department of Mathematics, University of Parma, Italy
http://www.cs.unipr.it/~bagnara/
mailto:bagnara@...
_______________________________________________
SWI-Prolog mailing list
SWI-Prolog@...
https://mailbox.iai.uni-bonn.de/mailman/listinfo.cgi/swi-prolog

Parent Message unknown Re: Nondeterministic segmentation fault in find_in_catalogue (sgml2pl.so)

by Roberto Bagnara :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Jan Wielemaker wrote:

> Hi Roberto,
>
> On Wednesday 30 September 2009 03:32:08 pm you wrote:
>> Jan Wielemaker wrote:
>>> On Tue, 2009-09-29 at 21:01 +0200, Roberto Bagnara wrote:
>>>> We are observing a nondeterministic segmentation fault in the function
>>>> find_in_catalogue of module sgml2pl.so.
>>>>
>>>> We have observed that this function has a very large stack frame
>>>> and the segmentation fault happens just after a stack pointer
>>>> decrease of more than 80 kB.
>>>>
>>>> We suspect that the stack expander gets confused and this is
>>>> the reason of the segmentation fault, which may be nondeterministic
>>>> due to the randomization of the begin-of-stack in Linux.
>>> Try to compile with -g and get line-number and further info about the
>>> crash.  Recent changes in 5.7.x fix at least two concurrency  issues
>>> in error-recovery of the parser, but this seems unrelated.
>> Hi Jan,
>>
>> I have done what you suggested.  Here are the results:
>>
>> Program terminated with signal 11, Segmentation fault.
>> [New process 30680]
>> #0  find_in_catalogue (kind=3, name=0x1c05cc0, pubid=0x0, sysid=0x0, ci=1)
>>      at catalog.c:548
>> 548 { ichar penname[FILENAME_MAX];
>> (gdb) frame 0
>> #0  find_in_catalogue (kind=3, name=0x1c05cc0, pubid=0x0, sysid=0x0, ci=1)
>>      at catalog.c:548
>> 548 { ichar penname[FILENAME_MAX];
>> Current language:  auto; currently c
>> (gdb) list
>> 543
>> 544 ichar const *
>> 545 find_in_catalogue(int kind,
>> 546  ichar const *name,
>> 547  ichar const *pubid, ichar const *sysid, int ci)
>> 548 { ichar penname[FILENAME_MAX];
>> 549  const size_t penlen = sizeof(penname)/sizeof(ichar);
>> 550  catalogue_item_ptr item;
>> 551  ichar const *result;
>> 552  catalog_file *catfile;
>> (gdb) x/i $pc
>> 0x7fffacc9eed9 <find_in_catalogue+25>: mov    %rsi,0x20(%rsp)
>> (gdb)
>>
>> And here is the assembly code for find_in_catalogue:
>>
>> 0000000000010ec0 <find_in_catalogue>:
>>     10ec0:       41 57                   push   %r15
>>     10ec2:       31 c0                   xor    %eax,%eax
>>     10ec4:       41 89 ff                mov    %edi,%r15d
>>     10ec7:       41 56                   push   %r14
>>     10ec9:       41 55                   push   %r13
>>     10ecb:       41 54                   push   %r12
>>     10ecd:       49 89 cc                mov    %rcx,%r12
>>     10ed0:       55                      push   %rbp
>>     10ed1:       53                      push   %rbx
>>     10ed2:       48 81 ec 48 40 01 00    sub    $0x14048,%rsp     (1)
>>     10ed9:       48 89 74 24 20          mov    %rsi,0x20(%rsp)   (2)
>>
>> As you can see, the stack is extended of 81992 bytes (hex 14048)
>> in instruction (1).  At the subsequent instruction (2), i.e.,
>> at the first access via the stack pointer, we have the segmentation
>> fault.
>>
>> I wonder if something is not working in the automatic stack extension
>> mechanism.
>
> Wouldn't be my first guess. Could it be that we have a simple stack
> overflow? What is your limit (ulimit -s)? Do the problems disappear if
> you enlarge that?

My default ulimit -s is 10240.  I can double or quadruple it and
the problem persists.  However, if I turn off randomization of the
virtual address space (with setarch -R) the problem disappears,
even if I reduce ulimit -s to half or a quarter.

> Then there are two issues: what makes the stack so
> big? (complete backtrace?)

Indeed the stack is not big: is is just 18 kB just before it is
extended by the 80kB mentioned above.

Here is the complete stack trace:

(gdb) info stack
#0  find_in_catalogue (kind=3, name=0x1c05cc0, pubid=0x0, sysid=0x0, ci=1)
     at catalog.c:548
#1  0x00007fffacc9c53a in open_element (p=0x1a75390, e=0x1c05ce0, warn=1)
     at parser.c:2868
#2  0x00007fffacc99df2 in process_begin_element (p=0x1a75390, decl=0x1c05a20)
     at parser.c:3421
#3  0x00007fffacc9a5d3 in process_declaration (p=0x1a75390, decl=0x1c05a20)
     at parser.c:3832
#4  0x00007fffacc9bb51 in putchar_dtd_parser (p=0x1a75390, chr=62)
     at parser.c:5013
#5  0x00007fffacca5cdf in pl_sgml_parse (parser=<value optimized out>,
     options=128) at sgml2pl.c:1793
#6  0x00007fffb55489d2 in PL_next_solution ()
    from /usr/local/lib/pl-5.7.15/lib/x86_64-linux/libpl.so.5.7.15
#7  0x00007fffb554d64e in PL_call_predicate ()
    from /usr/local/lib/pl-5.7.15/lib/x86_64-linux/libpl.so.5.7.15
#8  0x00000000006bc2c5 in main (argc=33, argv=0x7fffa01e4d28) at main.cc:197
Current language:  auto; currently c
(gdb)

Here is the stack size computation:

(gdb) info frame 0
Stack frame at 0x7fffa01e0440:
  rip = 0x7fffacc9eed9 in find_in_catalogue (catalog.c:548);
     saved rip 0x7fffacc9c53a
  called by frame at 0x7fffa01e2590
  source language c.
  Arglist at 0x7fffa01cc3b8, args: kind=3, name=0x1c05cc0, pubid=0x0,
     sysid=0x0, ci=1
  Locals at 0x7fffa01cc3b8, Previous frame's sp is 0x7fffa01e0440
  Saved registers:
   rbx at 0x7fffa01e0408, rbp at 0x7fffa01e0410, r12 at 0x7fffa01e0418,
   r13 at 0x7fffa01e0420, r14 at 0x7fffa01e0428, r15 at 0x7fffa01e0430,
   rip at 0x7fffa01e0438

(gdb) info frame 8
Stack frame at 0x7fffa01e4c60:
  rip = 0x6bc2c5 in main (main.cc:197); saved rip 0x3e33c1e576
  caller of frame at 0x7fffa01e4bd0
  source language c++.
  Arglist at 0x7fffa01e4c50, args: argc=33, argv=0x7fffa01e4d28
  Locals at 0x7fffa01e4c50, Previous frame's sp is 0x7fffa01e4c60
  Saved registers:
   rbx at 0x7fffa01e4c40, rbp at 0x7fffa01e4c50, r12 at 0x7fffa01e4c48,
   rip at 0x7fffa01e4c58

(gdb) print 0x7fffa01e0440LL-0x7fffa01e4c60LL
$3 = -18464

> and why is this frame so big? AFAIK, Linux
> FILENAME_MAX is 4K (could you print it?). ichar is wchar_t (4 bytes), so
> that makes 16Kb. I see no other arrays there and the rest should be
> peanuts. What do I miss?

I don't know. I have checked different versions of gcc on different
architectures and the results do not change.

$ gcc -c -O2 -Wall -fpic -DHAVE_CONFIG_H -pthread -fPIC -D_REENTRANT -D__SWI_PROLOG__ -D__SWI_EMBEDDED__ -I/usr/local/distrib/pl-5.7.15/include -I. -I/usr/local/distrib/pl-5.7.15/include -o catalog.o catalog.c
$ objdump --disassemble catalog.o | grep -A 11 -m 1 find_in_catalogue
00000000000004b0 <find_in_catalogue>:
  4b0: 41 57                 push   %r15
  4b2: 31 c0                 xor    %eax,%eax
  4b4: 41 89 ff             mov    %edi,%r15d
  4b7: 41 56                 push   %r14
  4b9: 41 55                 push   %r13
  4bb: 41 54                 push   %r12
  4bd: 49 89 cc             mov    %rcx,%r12
  4c0: 55                   push   %rbp
  4c1: 53                   push   %rbx
  4c2: 48 81 ec 48 40 01 00 sub    $0x14048,%rsp      <-- ~80 kB
  4c9: 48 89 74 24 20       mov    %rsi,0x20(%rsp)

Notice that, if optimizations are turned off,
much less space is allocated:

$ gcc -c -Wall -fpic -DHAVE_CONFIG_H -pthread -fPIC -D_REENTRANT -D__SWI_PROLOG__ -D__SWI_EMBEDDED__ -I/usr/local/distrib/pl-5.7.15/include -I. -I/usr/local/distrib/pl-5.7.15/include -o catalog.o catalog.c
$ objdump --disassemble catalog.o | grep -A 11 -m 1 find_in_catalogue
0000000000000b74 <find_in_catalogue>:
      b74: 55                   push   %rbp
      b75: 48 89 e5             mov    %rsp,%rbp
      b78: 48 81 ec 70 40 00 00 sub    $0x4070,%rsp <-- ~ 16 kB
      b7f: 89 bd dc bf ff ff     mov    %edi,-0x4024(%rbp)
      b85: 48 89 b5 d0 bf ff ff mov    %rsi,-0x4030(%rbp)
      b8c: 48 89 95 c8 bf ff ff mov    %rdx,-0x4038(%rbp)
      b93: 48 89 8d c0 bf ff ff mov    %rcx,-0x4040(%rbp)
      b9a: 44 89 85 bc bf ff ff mov    %r8d,-0x4044(%rbp)
      ba1: 48 c7 45 e0 00 10 00 movq   $0x1000,-0x20(%rbp)
      ba8: 00
      ba9: b8 00 00 00 00       mov    $0x0,%eax

Do you observe the same numbers?

Just to make sure this is not a bug of gcc, I tried compiling
the same file with the same options with icc, the Intel C compiler.
The stack allocation with icc is ~84 kB, and there is no
difference between optimized and non-optimized compilation:

$ icc -c -Wall -fpic -DHAVE_CONFIG_H -pthread -fPIC -D_REENTRANT -D__SWI_PROLOG__ -D__SWI_EMBEDDED__ -I/usr/local/distrib/pl-5.7.15/include -I. -I/usr/local/distrib/pl-5.7.15/include -o catalog.o catalog.c
catalog.c(159): remark #1418: external function definition with no prior declaration
   register_catalog_file_unlocked(const ichar *file, catalog_location where)
   ^

catalog.c(312): remark #869: parameter "buflen" was never referenced
   scan_overflow(size_t buflen)
                        ^

$ objdump --disassemble catalog.o | grep -A 11 -m 1 find_in_catalogue
0000000000000350 <find_in_catalogue>:
      350: 41 57                 push   %r15
      352: 41 56                 push   %r14
      354: 41 55                 push   %r13
      356: 41 54                 push   %r12
      358: 55                   push   %rbp
      359: 53                   push   %rbx
      35a: 48 81 ec 48 50 01 00 sub    $0x15048,%rsp  <-- ~84 kB
      361: 44 89 84 24 18 50 01 mov    %r8d,0x15018(%rsp)
      368: 00
      369: 48 89 94 24 20 50 01 mov    %rdx,0x15020(%rsp)
      370: 00

What can I do to help narrow down the problem further?
Cheers,

    Roberto

--
Prof. Roberto Bagnara
Computer Science Group
Department of Mathematics, University of Parma, Italy
http://www.cs.unipr.it/~bagnara/
mailto:bagnara@...
_______________________________________________
SWI-Prolog mailing list
SWI-Prolog@...
https://mailbox.iai.uni-bonn.de/mailman/listinfo.cgi/swi-prolog

Re: Nondeterministic segmentation fault in find_in_catalogue (sgml2pl.so)

by Nicolas Pelletier-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Re: Nondeterministic segmentation fault in find_in_catalogue (sgml2pl.so)

by Roberto Bagnara :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Nicolas Pelletier wrote:
> Hello,
>
> The following links may be useful:
>
> http://kernelnewbies.org/Linux_2_6_25#head-fc71477174376b69ac7c3cd153ed513d25bf2cae
> http://lkml.indiana.edu/hypermail/linux/kernel/0605.2/1294.html
> http://search.luky.org/linux-kernel.2005/msg10700.html
>
> Maybe you should also try with a different kernel.

Hi Nicolas.

Yes, yesterday Abramo was able to prove the problem is due to a bug
in the Linux kernel using the attached program.  Please note that
this is on-topic for SWI-Prolog/Linux users because the sgml package
(and possibly others) easily triggers the bug.  In order to detect
whether your system is vulnerable, you can do the following:

$ gcc -O2 -o detect_kernel_bug detect_kernel_bug.c
$ while : ; do ./detect_kernel_bug 1000 1000 || break ; done
82516
Segmentation fault
$ while : ; do ./detect_kernel_bug 1000 1000 || break ; done
79892
Segmentation fault
$ while : ; do ./detect_kernel_bug 1000 1000 || break ; done
81428
Segmentation fault
$ while : ; do ./detect_kernel_bug 1000 1000 || break ; done
80628
Segmentation fault

If your system is immune, the while will loop until you hit control-C.
Otherwise it will print the distance between the end of the mmapped
memory area and the beginning of the memory area allocated by the
program and the address of argc, which approximately corresponds to
the stack pointer value just before the allocation performed by the
program.

Luckily, the bug has been fixed in the Linux kernel on September 8th,
2009:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=80938332d8cf652f6b16e0788cf0ca136befe0b5

It may take some time, though, before the fix is propagated.
Meanwhile, a workaround is to disable randomization of the
virtual address space using setarch.
Cheers,

    Roberto

--
Prof. Roberto Bagnara
Computer Science Group
Department of Mathematics, University of Parma, Italy
http://www.cs.unipr.it/~bagnara/
mailto:bagnara@...

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>

// Run it with ./a.out mmap_MB alloca_KB [verbose]
int main(int argc, char **argv) {
  size_t mmap_sz = atoi(argv[1]);
  size_t alloca_sz = atoi(argv[2]);
  mmap_sz *= 1024 * 1024;
  alloca_sz *= 1024;
  int fd = open("/dev/zero", O_RDONLY);
  assert(fd >= 0);
  char* mmapped = mmap(0, mmap_sz, PROT_NONE, MAP_ANON|MAP_PRIVATE|MAP_NORESERVE, fd, 0L);
  assert(mmapped != 0);
  long d = (char*)&argc - (mmapped+mmap_sz);
  if (argc > 3 || (d >= 0 && d < (long)alloca_sz)) {
    printf("%ld\n", d);
    fflush(stdout);
  }
  char* stacked = __builtin_alloca(alloca_sz);
  *stacked = 0;
  return 0;
}

Re: Nondeterministic segmentation fault in find_in_catalogue (sgml2pl.so)

by Jan Wielemaker-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Friday 02 October 2009 10:07:16 am Roberto Bagnara wrote:
> while : ; do ./detect_kernel_bug 1000 1000 || break ; done

Thanks.  My machine is also subject :-)  I propose to ignore it
in general.  Typically, I'd expect this to be fixed in 1 or 2
months.  If you have problems with it, decrease FILENAME_MAX
as a work-around.

        --- Jan


_______________________________________________
SWI-Prolog mailing list
SWI-Prolog@...
https://mailbox.iai.uni-bonn.de/mailman/listinfo.cgi/swi-prolog