|
View:
New views
2 Messages
—
Rating Filter:
Alert me
|
|
|
SATA disk error and hang until "atacontrol reinit" ?I have a FreeBSD/SPARC
FreeBSD hydra.priv.oc.ietfng.org 9.0-CURRENT FreeBSD 9.0-CURRENT #11: Mon Oct 19 22:08:50 EDT 2009 root@...:/systank/obj/systank/src/sys/NWFKERN sparc64 with a atapci1: <Marvell 88SX6081 SATA300 controller> port 0x300-0x3ff mem 0x600000-0x6fffff,0x800000-0xbfffff at device 1.0 on pci3 and eight SATA2 disks: ad0: 305245MB <Seagate ST3320620AS 3.AAJ> at ata4-master SATA300 ad1: 305245MB <Seagate ST3320620AS 3.AAE> at ata5-master SATA300 ad2: 305245MB <Seagate ST3320620AS 3.AAE> at ata6-master SATA300 ad3: 305245MB <Seagate ST3320620AS 3.AAJ> at ata7-master SATA300 ad4: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata8-master SATA300 ad5: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata9-master SATA300 ad6: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata10-master SATA300 ad7: 715404MB <WDC WD7500AADS-00L5B1 01.01A01> at ata11-master SATA300 The two sets of four disks are each RAIDZ'd together, and the two RAIDZs are in one storage pool. I've been stress-testing the disks by scrubbing and find that after a few days of uptime, I will get ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=0 LBA=103200892 (It's always ad0 that fails) and all I/O directed at this storage pool through ZFS hangs. (I have not yet tested with dd from the raw disks; didn't think to do it, sorry.) During this period, zpool status reports 1 checksum error from ad0, though I don't know if this is occurs before, after, or in synchrony with the ad0 READ_DMA FAILURE. Previously, I just rebooted, but this time I thought to run "atacontrol reinit ata4" (which is the channel holding ad0). That caused the kernel to say ad0: WARNING - WRITE_DMA48 requeued due to channel reset LBA=625104384 ad0: FAILURE - already active DMA on this device ad0: setting up DMA failed zpool status now indicates that the scrub is proceeding again, and that ad0 has suffered 3 read, 1 write, and 1 checksum error. I/O directed at the storage tank works again. Is my disk going bad or is there something more funny here? Even if the disk is going bad, shouldn't the controller time out the request eventually? Thanks much in advance. --nwf; |
|
|
Re: SATA disk error and hang until "atacontrol reinit" ?As a follow up, I had this happen again, without rebooting after running
"atacontrol reinit ata4". That worked a few times, but the interval between failed requests was very small... At some point, reiniting the channel simply said "no devices attached" and I went and bought a new disk, fearing the worst. Placing that on the channel, I attempted "atacontrol reinit" again and it still refused to see the new disk. I tested both the new and old disks with a different controller and they worked fine. On the suggestion of ##freebsd, I ran "atacontrol detach ata4" and "atacontrol attach ata4"; the former completed, the latter produced this (via serial console): hydra# atacontrol attach ata4 ata4: [ITHREAD] 60201-4533^[[2;2~Master: no device present Slave: no device present ast data access mmu miss t ra p: f cpuid = 0 which I assume really meant to be "trap: fast data access mmu miss". I amn't sure what that means, but I have this lovely multi-gig core file now... which is potentially of little utility since I get told "GDB can't read core files on this machine." but if somebody wants something out of it, just ask. In any case, I powered the machine off, put the old disk back, and rebooted and am now scrubbing the RAIDZ, but it sees the old disk just fine. Suggestions? --nwf; |
| Free embeddable forum powered by Nabble | Forum Help |