zfs crash on amd64

View: New views
18 Messages — Rating Filter:   Alert me  

zfs crash on amd64

by Adam Hamsik-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi folks,

I would like to ask you for a help with one strange issue on a amd64  
with a zfs. On amd64 zfs panics during zpool create. Backtrace + panic  
message can be found here[1]. I got several other panics(different  
backtrace) but most of them was somewhat related with SHA256Transform.

I have added some printfs to the zio_checksum_SHA256 and everything  
looks fine, but when panic happens H[0] is NULL. I'm really not sure  
how is this possible, but I have checked it twice. PAnic happens here  
[2]. Panic instruction is this

movl 0(%rax), %eax

I checked rax register from ddb and it is set to 0, which is strange  
because when you look at debug printfs socend address printed in it is  
address ofH passed to SHA256Transform. I talked with joerg@ and he  
suggested that it can be some problem with stack overflow on amd64.  
This problem can be seen on i386 on machines with >3Gb. I would like  
to enlarge default kernel stack size on amd64 to try to fix this  
issue. Anyone has an idea how can I do that ?

I'm using DIAGNOSTIC modules build with -O0 -g.

[1]http://yfrog.com/e9screenshot20091102at120p
http://nxr.netbsd.org/xref/src/external/cddl/osnet/dist/uts/common/fs/zfs/sha256.c#97

Regards

Adam.


Re: zfs crash on amd64

by Adam Hamsik-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,
On Nov,Monday 2 2009, at 9:10 AM, Adam Hamsik wrote:

> Hi folks,
>
> I would like to ask you for a help with one strange issue on a amd64  
> with a zfs. On amd64 zfs panics during zpool create. Backtrace +  
> panic message can be found here[1]. I got several other panics
> (different backtrace) but most of them was somewhat related with  
> SHA256Transform.
>
> I have added some printfs to the zio_checksum_SHA256 and everything  
> looks fine, but when panic happens H[0] is NULL. I'm really not sure  
> how is this possible, but I have checked it twice. PAnic happens  
> here [2]. Panic instruction is this

I have tried to increase USPACE to 8 and it doesn't helped. I wonder  
if problem can be in a stack size of a kernel thread for a taskq.  
Because panic happens in taskq kthread. What/where is default kthread  
stack size defined or does it have it's own stack ?

Regards

Adam.


Re: zfs crash on amd64

by Mindaugas Rasiukevicius :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Adam Hamsik <haaaad@...> wrote:
> I have tried to increase USPACE to 8 and it doesn't helped. I wonder  
> if problem can be in a stack size of a kernel thread for a taskq.  
> Because panic happens in taskq kthread. What/where is default kthread  
> stack size defined or does it have it's own stack ?

Sounds like a wrong approach to the problem.  If your code exhausts kernel
stack, then it likely need to reduce the usage of it.  However, why do you
think it is a stack related problem?

--
Mindaugas

Re: zfs crash on amd64

by David Laight :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, Nov 02, 2009 at 09:10:44AM +0100, Adam Hamsik wrote:
>
> I talked with joerg@ and he  
> suggested that it can be some problem with stack overflow on amd64.  
> This problem can be seen on i386 on machines with >3Gb.

NetBSD i386 won't use memory that gets mapped above 4GB - so I don't
suppose you meant i386 did you??

If the problem doesn't happen when the system has < 3GB memory
it is more likely to do with some code failing to handle the phyaddr.

> I'm using DIAGNOSTIC modules build with -O0 -g.

You really don't want to use -O0, it will cause the code to be
much, much larger, run much, much slower and use more stack.
If it hides any bugs, when you change to -O2 for production use
you'll have to debug the code again.

It should be trivial to look at the code for the active functions
to find the size of their stack frames. Easiest using objdump and
reading the asm.
You also need to check any called functions.

If you are exploding the stack by a small amount - then calling
printf might be that last straw!

I'd look at the entire data areas involved (in ddb) to see how
far back the corruption goes.  It may be that the somthing
has got overwritten - if you can find the bounds of the
overwrite, and maybe recognise the contents, you stand 1/2 a chance.

        David

--
David Laight: david@...

Re: zfs crash on amd64

by Adam Hamsik-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Nov,Saturday 7 2009, at 5:38 AM, Mindaugas Rasiukevicius wrote:

> Adam Hamsik <haaaad@...> wrote:
>> I have tried to increase USPACE to 8 and it doesn't helped. I wonder
>> if problem can be in a stack size of a kernel thread for a taskq.
>> Because panic happens in taskq kthread. What/where is default kthread
>> stack size defined or does it have it's own stack ?
>
> Sounds like a wrong approach to the problem.  If your code exhausts  
> kernel
> stack, then it likely need to reduce the usage of it.  However, why  
> do you
> think it is a stack related problem?

Because SHA256Transform is called from zio_checksum_SHA256 with *H  
parameter set to zio_checksum_SHA256 local(defined on a stack) array.  
I have a printf before SHA256Transform call and it shows that passed H  
is not NULL. Later in SHA256Transform I got crash because of H == NULL.

Place from where SHA256Transform is called
http://nxr.aydogan.net/xref/src/external/cddl/osnet/dist/uts/common/fs/zfs/sha256.c#110

Place where panic happens
http://nxr.aydogan.net/xref/src/external/cddl/osnet/dist/uts/common/fs/zfs/sha256.c#97
panic happens on H[0] += a; In same function on lines 87/88 is H  
referenced, too.

When I was in ddb I did this

Panic instruction is:
movl 0(%rax), %eax

if I check show registers
%rax is 0

Regards

Adam.


Re: zfs crash on amd64

by Adam Hamsik-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Nov,Saturday 7 2009, at 10:05 AM, David Laight wrote:

> On Mon, Nov 02, 2009 at 09:10:44AM +0100, Adam Hamsik wrote:
>>
>> I talked with joerg@ and he
>> suggested that it can be some problem with stack overflow on amd64.
>> This problem can be seen on i386 on machines with >3Gb.
>
> NetBSD i386 won't use memory that gets mapped above 4GB - so I don't
> suppose you meant i386 did you??
>
> If the problem doesn't happen when the system has < 3GB memory
> it is more likely to do with some code failing to handle the phyaddr.
>
>> I'm using DIAGNOSTIC modules build with -O0 -g.
>
> You really don't want to use -O0, it will cause the code to be
> much, much larger, run much, much slower and use more stack.
> If it hides any bugs, when you change to -O2 for production use
> you'll have to debug the code again.

Otherwise gcc will inline functions for me makes almost impossible to  
find out what is going on.

>
> It should be trivial to look at the code for the active functions
> to find the size of their stack frames. Easiest using objdump and
> reading the asm.
> You also need to check any called functions.

Yes I saw results of something like this and these functions were not  
there.

> If you are exploding the stack by a small amount - then calling
> printf might be that last straw!
>
> I'd look at the entire data areas involved (in ddb) to see how
> far back the corruption goes.  It may be that the somthing
> has got overwritten - if you can find the bounds of the
> overwrite, and maybe recognise the contents, you stand 1/2 a chance.

10 lines of code above panic line it worked just fine [1] panic [2].  
What I don't understand is why this code works on i386 and doesn't on  
amd64.

[1] http://nxr.aydogan.net/xref/src/external/cddl/osnet/dist/uts/common/fs/zfs/sha256.c#87
[2] http://nxr.aydogan.net/xref/src/external/cddl/osnet/dist/uts/common/fs/zfs/sha256.c#97
Regards

Adam.


Re: zfs crash on amd64

by Thor Lancelot Simon-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sat, Nov 07, 2009 at 11:29:42AM +0100, Adam Hamsik wrote:
>
>> You really don't want to use -O0, it will cause the code to be
>> much, much larger, run much, much slower and use more stack.
>> If it hides any bugs, when you change to -O2 for production use
>> you'll have to debug the code again.
>
> Otherwise gcc will inline functions for me makes almost impossible to  
> find out what is going on.

Do not drive screws with a hammer.  Use a screwdriver.

Just add -fno-inline to the compilation options.

Thor

Re: zfs crash on amd64

by Adam Hamsik-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Nov,Saturday 7 2009, at 5:45 PM, Thor Lancelot Simon wrote:

> On Sat, Nov 07, 2009 at 11:29:42AM +0100, Adam Hamsik wrote:
>>
>>> You really don't want to use -O0, it will cause the code to be
>>> much, much larger, run much, much slower and use more stack.
>>> If it hides any bugs, when you change to -O2 for production use
>>> you'll have to debug the code again.
>>
>> Otherwise gcc will inline functions for me makes almost impossible to
>> find out what is going on.

Thanks to help of many developers especially cube@ we have found that  
this panic is caused by bug in a gcc which we use.

Regards

Adam.


Re: zfs crash on amd64

by Quentin Garnier :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, Nov 08, 2009 at 04:14:18AM +0100, Adam Hamsik wrote:

>
> On Nov,Saturday 7 2009, at 5:45 PM, Thor Lancelot Simon wrote:
>
> >On Sat, Nov 07, 2009 at 11:29:42AM +0100, Adam Hamsik wrote:
> >>
> >>>You really don't want to use -O0, it will cause the code to be
> >>>much, much larger, run much, much slower and use more stack.
> >>>If it hides any bugs, when you change to -O2 for production use
> >>>you'll have to debug the code again.
> >>
> >>Otherwise gcc will inline functions for me makes almost impossible to
> >>find out what is going on.
>
> Thanks to help of many developers especially cube@ we have found
> that this panic is caused by bug in a gcc which we use.
To those who would raise an eyebrow to that somewhat bold assertion:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41990

--
Quentin Garnier - cube@... - cube@...
"See the look on my face from staying too long in one place
[...] every time the morning breaks I know I'm closer to falling"
KT Tunstall, Saving My Face, Drastic Fantastic, 2007.


attachment0 (498 bytes) Download Attachment

Re: zfs crash on amd64

by David Laight :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, Nov 08, 2009 at 08:32:42PM +0000, Quentin Garnier wrote:
>
> To those who would raise an eyebrow to that somewhat bold assertion:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41990

IIRC there is a compile option that you need to kernel code.

        David

--
David Laight: david@...

Re: zfs crash on amd64

by Rhialto :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun 08 Nov 2009 at 20:44:57 +0000, David Laight wrote:
> On Sun, Nov 08, 2009 at 08:32:42PM +0000, Quentin Garnier wrote:
> >
> > To those who would raise an eyebrow to that somewhat bold assertion:
> >
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41990
>
> IIRC there is a compile option that you need to kernel code.

You mean -ffreestanding? I tried that but it makes no difference. And
anyway, I don't think the C standard requires interrupts to be taken on
a different stack than the user program uses. (and if it did, why bother
allocating stackspace then at all, in a leaf function?) So that can't be
an explanation, IMHO.

> David
-Olaf.
--
___ Olaf 'Rhialto' Seibert    -- You author it, and I'll reader it.
\X/ rhialto/at/xs4all.nl      -- Cetero censeo "authored" delendum esse.

Re: zfs crash on amd64

by Quentin Garnier :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, Nov 08, 2009 at 08:44:57PM +0000, David Laight wrote:
> On Sun, Nov 08, 2009 at 08:32:42PM +0000, Quentin Garnier wrote:
> >
> > To those who would raise an eyebrow to that somewhat bold assertion:
> >
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41990
>
> IIRC there is a compile option that you need to kernel code.

Ok, I understand now.  So it's not a gcc bug, but rather that we don't
compile the kernel and the modules properly on amd64.  We need to
compile them with -fno-red-zone.

--
Quentin Garnier - cube@... - cube@...
"See the look on my face from staying too long in one place
[...] every time the morning breaks I know I'm closer to falling"
KT Tunstall, Saving My Face, Drastic Fantastic, 2007.


attachment0 (498 bytes) Download Attachment

Re: zfs crash on amd64

by David Laight :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, Nov 08, 2009 at 09:48:29PM +0100, Rhialto wrote:

> On Sun 08 Nov 2009 at 20:44:57 +0000, David Laight wrote:
> > On Sun, Nov 08, 2009 at 08:32:42PM +0000, Quentin Garnier wrote:
> > >
> > > To those who would raise an eyebrow to that somewhat bold assertion:
> > >
> > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41990
> >
> > IIRC there is a compile option that you need to kernel code.
>
> You mean -ffreestanding? I tried that but it makes no difference. And
> anyway, I don't think the C standard requires interrupts to be taken on
> a different stack than the user program uses. (and if it did, why bother
> allocating stackspace then at all, in a leaf function?) So that can't be
> an explanation, IMHO.

no, -mno-red-zone

        David

--
David Laight: david@...

Re: zfs crash on amd64

by Joerg Sonnenberger :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, Nov 08, 2009 at 08:53:25PM +0000, Quentin Garnier wrote:
> Ok, I understand now.  So it's not a gcc bug, but rather that we don't
> compile the kernel and the modules properly on amd64.  We need to
> compile them with -fno-red-zone.

Makefile.amd64 certainly sets -mcmode=kernel -mno-red-zone.

Joerg

Re: zfs crash on amd64

by Rhialto :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun 08 Nov 2009 at 21:03:09 +0000, David Laight wrote:
> no, -mno-red-zone

Aha. Interesting. For context I'll quote what the manpage has to say
about it:

       -mno-red-zone
           Do not use a so called red zone for x86-64 code.  The red zone is
           mandated by the x86-64 ABI, it is a 128-byte area beyond the loca-
           tion of the stack pointer that will not be modified by signal or
           interrupt handlers and therefore can be used for temporary data
           without adjusting the stack pointer.  The flag -mno-red-zone dis-
           ables this red zone.

Wasn't the name red zone already used for the bottom of the stack,
unmapped so it would trap and detect stack overflows? Or was that some
different colour? In any case, the name is confusing in this regard.

It seems to be an attempt to save on stack pointer adjustments when
calling functions. However, the compiler needs to do a full static
analysis of the call graph of the program to show that it can take
advantage of it. Otherwise, you'll never know if the local variables of
any particular function still fit in the magic 128 allotted bytes. And
since gcc compiles programs in modules, and the linker also isn't clever
enough to adjust for this sort of things at link time, it seems rather
useless to me. Maybe only for hand-written assembly. But who wants to
write in that crap sort of assembly language anyway? It's not VAX or
PDP-11 or even 680x0...

The example we've seen here also seems rather useless. It was a leaf
function with a big stack frame, which was allocated 128 bytes short,
using the red zone as a promise that it won't get clobbered. However,
such a construction absolutely requires interrupts to go on a separate
stack (unless hardware interrupt processing always first deducts 128
from the stack pointer before storing its interrupt frame??? No
apparently it doesn't since it fails inside the kernel).
And if this is done, there is no need to use the user stack anyway. So
the zone goes to waste. Furthermore, once you're adjusting a stack
pointer, there seems to be no point in not adjusting it the full size,
since the cost of adjusting it has already been taken.

> David
-Olaf.
--
___ Olaf 'Rhialto' Seibert    -- You author it, and I'll reader it.
\X/ rhialto/at/xs4all.nl      -- Cetero censeo "authored" delendum esse.

Re: zfs crash on amd64

by Michael L. Hitch :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, 8 Nov 2009, Joerg Sonnenberger wrote:

> On Sun, Nov 08, 2009 at 08:53:25PM +0000, Quentin Garnier wrote:
>> Ok, I understand now.  So it's not a gcc bug, but rather that we don't
>> compile the kernel and the modules properly on amd64.  We need to
>> compile them with -fno-red-zone.
>
> Makefile.amd64 certainly sets -mcmode=kernel -mno-red-zone.

   The modules only use -mcmode=kernel.  I think share/mk/bsd.klinks.mk
needs to also add -mno-red-zone to CFLAGS.

--
Michael L. Hitch mhitch@...
Computer Consultant
Information Technology Center
Montana State University Bozeman, MT USA

Re: zfs crash on amd64

by Quentin Garnier :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, Nov 08, 2009 at 11:49:59PM +0100, Rhialto wrote:
[...]

You're welcome to discuss the pros and cons of the x86_64 ABI, but in
this thread it is just meaningless noise.

--
Quentin Garnier - cube@... - cube@...
"See the look on my face from staying too long in one place
[...] every time the morning breaks I know I'm closer to falling"
KT Tunstall, Saving My Face, Drastic Fantastic, 2007.


attachment0 (498 bytes) Download Attachment

Re: zfs crash on amd64

by Quentin Garnier :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, Nov 08, 2009 at 05:17:52PM -0700, Michael L. Hitch wrote:

> On Sun, 8 Nov 2009, Joerg Sonnenberger wrote:
>
> >On Sun, Nov 08, 2009 at 08:53:25PM +0000, Quentin Garnier wrote:
> >>Ok, I understand now.  So it's not a gcc bug, but rather that we don't
> >>compile the kernel and the modules properly on amd64.  We need to
> >>compile them with -fno-red-zone.
> >
> >Makefile.amd64 certainly sets -mcmode=kernel -mno-red-zone.
>
>   The modules only use -mcmode=kernel.  I think share/mk/bsd.klinks.mk
> needs to also add -mno-red-zone to CFLAGS.
Actually I think the right way to do that would be to create
sys/arch/amd64/include/Makefile.inc and put the CFLAGS addition there.

--
Quentin Garnier - cube@... - cube@...
"See the look on my face from staying too long in one place
[...] every time the morning breaks I know I'm closer to falling"
KT Tunstall, Saving My Face, Drastic Fantastic, 2007.


attachment0 (498 bytes) Download Attachment