SATA disk error and hang until "atacontrol reinit" ?

View: New views
2 Messages — Rating Filter:   Alert me  

SATA disk error and hang until "atacontrol reinit" ?

by Nathaniel W Filardo :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I have a FreeBSD/SPARC
  FreeBSD hydra.priv.oc.ietfng.org 9.0-CURRENT FreeBSD 9.0-CURRENT
  #11: Mon Oct 19 22:08:50 EDT 2009
  root@...:/systank/obj/systank/src/sys/NWFKERN
  sparc64
with a
  atapci1: <Marvell 88SX6081 SATA300 controller> port 0x300-0x3ff
    mem 0x600000-0x6fffff,0x800000-0xbfffff at device 1.0 on pci3
and eight SATA2 disks:
  ad0: 305245MB <Seagate ST3320620AS 3.AAJ> at ata4-master SATA300
  ad1: 305245MB <Seagate ST3320620AS 3.AAE> at ata5-master SATA300
  ad2: 305245MB <Seagate ST3320620AS 3.AAE> at ata6-master SATA300
  ad3: 305245MB <Seagate ST3320620AS 3.AAJ> at ata7-master SATA300
  ad4: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata8-master SATA300
  ad5: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata9-master SATA300
  ad6: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata10-master SATA300
  ad7: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata11-master SATA300

The two sets of four disks are each RAIDZ'd together, and the two RAIDZs are
in one storage pool.

I've been stress-testing the disks by scrubbing and find that after a few
days of uptime, I will get
  ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=0 LBA=103200892
(It's always ad0 that fails) and all I/O directed at this storage pool through
ZFS hangs. (I have not yet tested with dd from the raw disks; didn't think
to do it, sorry.)  During this period, zpool status reports 1 checksum error
from ad0, though I don't know if this is occurs before, after, or in
synchrony with the ad0 READ_DMA FAILURE.

Previously, I just rebooted, but this time I thought to run "atacontrol
reinit ata4" (which is the channel holding ad0).  That caused the kernel to
say
  ad0: WARNING - WRITE_DMA48 requeued due to channel reset LBA=625104384
  ad0: FAILURE - already active DMA on this device
  ad0: setting up DMA failed
zpool status now indicates that the scrub is proceeding again, and that ad0
has suffered 3 read, 1 write, and 1 checksum error.  I/O directed at the
storage tank works again.

Is my disk going bad or is there something more funny here?  Even if the
disk is going bad, shouldn't the controller time out the request eventually?

Thanks much in advance.
--nwf;


attachment0 (204 bytes) Download Attachment

Re: SATA disk error and hang until "atacontrol reinit" ?

by Nathaniel W Filardo :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

As a follow up, I had this happen again, without rebooting after running
"atacontrol reinit ata4".  That worked a few times, but the interval between
failed requests was very small... At some point, reiniting the channel
simply said "no devices attached" and I went and bought a new disk, fearing
the worst.  Placing that on the channel, I attempted "atacontrol reinit"
again and it still refused to see the new disk.  I tested both the new and
old disks with a different controller and they worked fine.  On the
suggestion of ##freebsd, I ran "atacontrol detach ata4" and "atacontrol
attach ata4"; the former completed, the latter produced this (via serial
console):

hydra# atacontrol attach ata4
ata4: [ITHREAD]
 60201-4533^[[2;2~Master:      no device present
Slave:       no device present
ast data access mmu miss   t                ra                p:
f
cpuid = 0

which I assume really meant to be "trap: fast data access mmu miss".  I
amn't sure what that means, but I have this lovely multi-gig core file
now... which is potentially of little utility since I get told "GDB can't
read core files on this machine." but if somebody wants something out of it,
just ask.

In any case, I powered the machine off, put the old disk back, and rebooted
and am now scrubbing the RAIDZ, but it sees the old disk just fine.

Suggestions?
--nwf;


attachment0 (204 bytes) Download Attachment