hammer errors

View: New views
3 Messages — Rating Filter:   Alert me  

hammer errors

by Sascha Wildner :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

ever since the last time I had CRC problems on my router box, I've
developed the habit of doing a daily 'hammer -f /dev/ad4s1d show |& grep
"^B"' to see if any new errors crept up, and today I found:

yoyodyne# hammer -f /dev/ad4s1d show |& grep "^B"
B                dataoff=a00000714d120000/65536 crc=7e4f7545
B                dataoff=a000007171380000/65536 crc=616b1cc1

Console log for the recent days is:

Nov  7 03:15:19 <kern.crit> yoyodyne kernel: HAMMER: Warning: rebalance
caught race against propagate
Nov  7 03:15:19 <kern.crit> yoyodyne last message repeated 2 times
Nov  8 03:05:33 <kern.crit> yoyodyne kernel: bio_page_alloc: WARNING
emergency page allocation
Nov  8 03:19:41 <kern.info> yoyodyne kernel: nfs send error 32 for
server 192.168.0.10:/backup
Nov  8 03:19:41 <kern.info> yoyodyne kernel: receive error 54 from nfs
server 192.168.0.10:/backup
Nov  9 03:56:32 <kern.crit> yoyodyne kernel: Warning: vfsync_bp skipping
dirty buffer 0xc2706098
Nov  9 03:57:03 <kern.crit> yoyodyne kernel: Warning: vfsync_bp skipping
dirty buffer 0xc26eb26c

smartctl -a /dev/ad4 doesn't report any problems.

The box is running 2.4.1 (v2.4.1.8.g93de5-RELEASE, to be specific).

So my question is: What are my next steps in order to help resolve this
issue? Is there any way to get e.g. to the names of the files affected
by this problem from the data which is output by 'hammer show'?

So far the only thing I've done is to disable nightly hammer cleanup
because DragonFly, upon encountering a CRC error, will unfortunately
simply drop to the debugger without panicing, so this doesn't get caught
by DDB_UNATTENDED as far as I can tell (Matt, are there any plans to
change this unpleasant behavior?). And I won't be near that box until
next weekend.

Regards,
Sascha

--
http://yoyodyne.ath.cx

Re: hammer errors

by Matthew Dillon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

:Hi,
:
:ever since the last time I had CRC problems on my router box, I've
:developed the habit of doing a daily 'hammer -f /dev/ad4s1d show |& grep
:"^B"' to see if any new errors crept up, and today I found:
:
:yoyodyne# hammer -f /dev/ad4s1d show |& grep "^B"
:B                dataoff=a00000714d120000/65536 crc=7e4f7545
:B                dataoff=a000007171380000/65536 crc=616b1cc1

    The question is whether it is real or not.  If the filesystem is
    mounted live then the show command could be catching things in
    odd states.

:Console log for the recent days is:
:
:Nov  7 03:15:19 <kern.crit> yoyodyne kernel: HAMMER: Warning: rebalance
:caught race against propagate
:...

    None of those are serious.  Basically just debug messages that will
    be removed soon.  The emergency page allocation for BIO is unrelated
    to the filesystem code.  It's also actually just a warning (telling me
    that something is eating too many free VM pages).

:So my question is: What are my next steps in order to help resolve this
:issue? Is there any way to get e.g. to the names of the files affected
:by this problem from the data which is output by 'hammer show'?
:
:So far the only thing I've done is to disable nightly hammer cleanup
:because DragonFly, upon encountering a CRC error, will unfortunately
:simply drop to the debugger without panicing, so this doesn't get caught
:by DDB_UNATTENDED as far as I can tell (Matt, are there any plans to
:change this unpleasant behavior?). And I won't be near that box until
:next weekend.
:
:Regards,
:Sascha

    I fixed the behavior in current.  There is now a sysctl which
    controls whether it drops into the debugger or not (and it does not
    by default).  Though it doesn't panic... maybe the sysctl should be
    modified to give it the ability to panic instead of propagating an
    error code up the call chain.  The filesystem still drops into
    read-only mode if an error is encountered.

    What you want to do now is run 'hammer -f ... show | less -B' and
    search for B, as in '/^B'.  less -B uses a fixed buffer so if you
    scroll down you basically cannot scroll back up (by much), which allows
    you to pipe gigabytes and gigabytes of text through it without it
    malloc()ing itself into oblivion.  You want to try to find the problem
    area and get more context out of it, such as the object id.  And also
    to determine whether the problem area is real or not.

    Again the filesystem has to be idle and it would be even better if it
    were offline entirely.

                                        -Matt
                                        Matthew Dillon
                                        <dillon@...>

Re: hammer errors

by Sascha Wildner :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Matthew Dillon schrieb:

> :So my question is: What are my next steps in order to help resolve this
> :issue? Is there any way to get e.g. to the names of the files affected
> :by this problem from the data which is output by 'hammer show'?
> :
> :So far the only thing I've done is to disable nightly hammer cleanup
> :because DragonFly, upon encountering a CRC error, will unfortunately
> :simply drop to the debugger without panicing, so this doesn't get caught
> :by DDB_UNATTENDED as far as I can tell (Matt, are there any plans to
> :change this unpleasant behavior?). And I won't be near that box until
> :next weekend.
> :
> :Regards,
> :Sascha
>
>     I fixed the behavior in current.  There is now a sysctl which
>     controls whether it drops into the debugger or not (and it does not
>     by default).  Though it doesn't panic... maybe the sysctl should be
>     modified to give it the ability to panic instead of propagating an
>     error code up the call chain.  The filesystem still drops into
>     read-only mode if an error is encountered.
>
>     What you want to do now is run 'hammer -f ... show | less -B' and
>     search for B, as in '/^B'.  less -B uses a fixed buffer so if you
>     scroll down you basically cannot scroll back up (by much), which allows
>     you to pipe gigabytes and gigabytes of text through it without it
>     malloc()ing itself into oblivion.  You want to try to find the problem
>     area and get more context out of it, such as the object id.  And also
>     to determine whether the problem area is real or not.

OK, here's some more context from the errors. Is that enough? I fear I'm
not used enough to reading hammer show output. I will re-check the
filesystem in an unmounted state on the weekend.

G------ ELM 24 R obj=000000011164da5c key=000000000c250000 lo=00040002
rt=10 ot=02
                  tids 0000000111655c50:0000000000000000
B                dataoff=a00000714d120000/65536 crc=7e4f7545
                  fills=z10:58010=100%

G------ ELM  0 R obj=000000011164da5c key=00000000302e0000 lo=00040002
rt=10 ot=02
                  tids 0000000111656eb0:0000000000000000
B                dataoff=a000007171380000/65536 crc=616b1cc1
                  fills=z10:58082=100%

obj is the same for both even though they are in different parts of the
hammer show output.

Sascha

--
http://yoyodyne.ath.cx