SATA DMA errors on second ICH10 bus

View: New views
9 Messages — Rating Filter:   Alert me  

SATA DMA errors on second ICH10 bus

by Dylan Alex Simon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I have four identical SATA300 disks on a Supermicro C2SEA which has 6 total
SATA ports on an ICH10.  This is from 20090114 CURRENT sources with a custom
kernel (no PREEMPTION, INVARIANTS, WITNESS, and many drivers removed, but
otherwise same as GENERIC).

The first two disks (ad6 ad7) seem to work fine, and I've done buildworld,
zfs, and nfs tests on both of them together and separately.  The second two
disks (ad8 ad9) work okay for some things (labeling, zpool creation, gmirror
creation, dd slice), but as soon as I start doing anything more complicated
involving at least one of them (gmirror write access, cp to ufs partition, cp
to zfs over nfs, etc.) I get the following errors on all involved disks
(including the first two):

Jan 15 17:35:07 lust kernel: ad8: FAILURE - load data
Jan 15 17:35:07 lust kernel: ad8: setting up DMA failed
Jan 15 17:35:07 lust kernel: ad8: FAILURE - load data
Jan 15 17:35:07 lust kernel: ad8: setting up DMA failed
Jan 15 17:35:07 lust kernel: g_vfs_done():ad8s1e[WRITE(offset=1881014272, length=131072)]error = 5
Jan 15 17:35:07 lust kernel: ad6: FAILURE - load data
Jan 15 17:35:07 lust kernel: ad6: setting up DMA failed
Jan 15 17:35:07 lust kernel: g_vfs_done():ad6s1e[READ(offset=4117364736, length=32768)]error = 5
Jan 15 17:35:07 lust kernel: vnode_pager_getpages: I/O read error
Jan 15 17:35:07 lust kernel: vm_fault: pager read error, pid 985 (cp)

This continues for a while and then with ufs panics pretty soon.  With zfs it
starts hanging most processes after awhile.  (7.1 found the disks but failed
to complete booting, freezing up right after probing.  The only related issue
I could find is kern/125859.)

I'm happy to provide any information needed or try patches.  I would also like
to know if there's a way to turn off DMA on just these two disks (atacontrol
mode won't seem to set anything but SATA300).

Thanks,
:-Dylan


FreeBSD lust.cns.nyu.edu 8.0-CURRENT FreeBSD 8.0-CURRENT #0: Wed Jan 14 19:58:58 EST 2009     dylan@...:/usr/obj/usr/src/sys/SIN  amd64

dmesg (partial):
lust kernel: pcib3: <ACPI PCI-PCI bridge> at device 30.0 on pci0
lust kernel: pci3: <ACPI PCI bus> on pcib3
lust kernel: atapci0: <ITE IT8213F UDMA133 controller> port 0xec00-0xec07,0xe880-0xe883,0xe800-0xe807,0xe480-0xe483,0xe400-0xe40f irq 22 at device 4.0 on pci3
lust kernel: atapci0: [ITHREAD]
lust kernel: ata2: <ATA channel 0> on atapci0
lust kernel: ata2: [ITHREAD]
lust kernel: pci3: <serial bus, FireWire> at device 8.0 (no driver attached)
lust kernel: isab0: <PCI-ISA bridge> at device 31.0 on pci0
lust kernel: isa0: <ISA bus> on isab0
lust kernel: atapci1: <Intel ICH10 SATA300 controller> port 0xc400-0xc407,0xc080-0xc083,0xc000-0xc007,0xbc00-0xbc03,0xb880-0xb88f,0xb800-0xb80f irq 19 at device 31.2 on pci0
lust kernel: atapci1: [ITHREAD]
lust kernel: ata3: <ATA channel 0> on atapci1
lust kernel: ata3: [ITHREAD]
lust kernel: ata4: <ATA channel 1> on atapci1
lust kernel: ata4: [ITHREAD]
lust kernel: pci0: <serial bus, SMBus> at device 31.3 (no driver attached)
lust kernel: atapci2: <Intel ICH10 SATA300 controller> port 0xb400-0xb407,0xb080-0xb083,0xb000-0xb007,0xac00-0xac03,0xa880-0xa88f,0xa800-0xa80f irq 19 at device 31.5 on pci0
lust kernel: atapci2: [ITHREAD]
lust kernel: ata5: <ATA channel 0> on atapci2
lust kernel: ata5: [ITHREAD]
lust kernel: ata6: <ATA channel 1> on atapci2
lust kernel: ata6: [ITHREAD]
lust kernel: est: CPU supports Enhanced Speedstep, but is not recognized.
lust kernel: est: cpu_vendor GenuineIntel, msr 61a0a2006000a20
lust kernel: acd0: DVDROM <ATAPI DVD D DH16D3P/1P52> at ata2-master UDMA33
lust kernel: ad6: 953869MB <Seagate ST31000333AS CC1F> at ata3-master SATA300
lust kernel: ad7: 953869MB <Seagate ST31000333AS CC1F> at ata3-slave SATA300
lust kernel: ad8: 953869MB <Seagate ST31000333AS CC1F> at ata4-master SATA300
lust kernel: ad9: 953869MB <Seagate ST31000333AS CC1F> at ata4-slave SATA300

atacontrol list:
ATA channel 2:
    Master: acd0 <ATAPI DVD D DH16D3P/1P52> ATA/ATAPI revision 7
    Slave:       no device present
ATA channel 3:
    Master:  ad6 <ST31000333AS/CC1F> Serial ATA II
    Slave:   ad7 <ST31000333AS/CC1F> Serial ATA II
ATA channel 4:
    Master:  ad8 <ST31000333AS/CC1F> Serial ATA II
    Slave:   ad9 <ST31000333AS/CC1F> Serial ATA II
ATA channel 5:
    Master:      no device present
    Slave:       no device present
ATA channel 6:
    Master:      no device present
    Slave:       no device present

pciconf -lv (partial):
pcib3@pci0:0:30:0:      class=0x060401 card=0xb88015d9 chip=0x244e8086 rev=0x90 hdr=0x01
    vendor     = 'Intel Corporation'
    device     = '82801 Family (ICH2/3/4/4/5/5/6/7/8/9,63xxESB) Hub Interface to PCI Bridge'
    class      = bridge
    subclass   = PCI-PCI
isab0@pci0:0:31:0:      class=0x060100 card=0xb88015d9 chip=0x3a188086 rev=0x00 hdr=0x00
    vendor     = 'Intel Corporation'
    class      = bridge
    subclass   = PCI-ISA
atapci1@pci0:0:31:2:    class=0x01018f card=0xb88015d9 chip=0x3a208086 rev=0x00 hdr=0x00
    vendor     = 'Intel Corporation'
    class      = mass storage
    subclass   = ATA
atapci2@pci0:0:31:5:    class=0x010185 card=0xb88015d9 chip=0x3a268086 rev=0x00 hdr=0x00
    vendor     = 'Intel Corporation'
    class      = mass storage
    subclass   = ATA
_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."

Re: SATA DMA errors on second ICH10 bus

by Dylan Alex Simon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> FreeBSD lust.cns.nyu.edu 8.0-CURRENT FreeBSD 8.0-CURRENT #0: Wed Jan 14 19:58:58 EST 2009     dylan@...:/usr/obj/usr/src/sys/SIN  amd64

Sorry, I'd meant to enable verbose messages before:

verbose dmesg (partial):
lust kernel: atapci0: <ITE IT8213F UDMA133 controller> port 0xec00-0xec07,0xe880-0xe883,0xe800-0xe807,0xe480-0xe483,0xe400-0xe40f irq 22 at device 4.0 on pci3
lust kernel: atapci0: Reserved 0x10 bytes for rid 0x20 type 4 at 0xe400
lust kernel: ioapic0: routing intpin 22 (PCI IRQ 22) to vector 53
lust kernel: atapci0: [MPSAFE]
lust kernel: atapci0: [ITHREAD]
lust kernel: ata2: <ATA channel 0> on atapci0
lust kernel: atapci0: Reserved 0x8 bytes for rid 0x10 type 4 at 0xec00
lust kernel: atapci0: Reserved 0x4 bytes for rid 0x14 type 4 at 0xe880
lust kernel: ata2: reset tp1 mask=03 ostat0=50 ostat1=00
lust kernel: ata2: stat0=0x00 err=0x01 lsb=0x14 msb=0xeb
lust kernel: ata2: stat1=0x00 err=0x00 lsb=0x00 msb=0x00
lust kernel: ata2: reset tp2 stat0=00 stat1=00 devices=0x10000
lust kernel: ata2: [MPSAFE]
lust kernel: ata2: [ITHREAD]
lust kernel: pci3: <serial bus, FireWire> at device 8.0 (no driver attached)
lust kernel: isab0: <PCI-ISA bridge> at device 31.0 on pci0
lust kernel: isa0: <ISA bus> on isab0
lust kernel: atapci1: <Intel ICH10 SATA300 controller> port 0xc400-0xc407,0xc080-0xc083,0xc000-0xc007,0xbc00-0xbc03,0xb880-0xb88f,0xb800-0xb80f irq 19 at device 31.2 on pci0
lust kernel: atapci1: Reserved 0x10 bytes for rid 0x20 type 4 at 0xb880
lust kernel: atapci1: [MPSAFE]
lust kernel: atapci1: [ITHREAD]
lust kernel: atapci1: Reserved 0x10 bytes for rid 0x24 type 4 at 0xb800
lust kernel: ata3: <ATA channel 0> on atapci1
lust kernel: atapci1: Reserved 0x8 bytes for rid 0x10 type 4 at 0xc400
lust kernel: atapci1: Reserved 0x4 bytes for rid 0x14 type 4 at 0xc080
lust kernel: ata3: reset tp1 mask=03 ostat0=50 ostat1=50
lust kernel: ata3: stat0=0x50 err=0x01 lsb=0x00 msb=0x00
lust kernel: ata3: stat1=0x50 err=0x01 lsb=0x00 msb=0x00
lust kernel: ata3: reset tp2 stat0=50 stat1=50 devices=0x3
lust kernel: ata3: [MPSAFE]
lust kernel: ata3: [ITHREAD]
lust kernel: ata4: <ATA channel 1> on atapci1
lust kernel: atapci1: Reserved 0x8 bytes for rid 0x18 type 4 at 0xc000
lust kernel: atapci1: Reserved 0x4 bytes for rid 0x1c type 4 at 0xbc00
lust kernel: ata4: reset tp1 mask=03 ostat0=50 ostat1=50
lust kernel: ata4: stat0=0x50 err=0x01 lsb=0x00 msb=0x00
lust kernel: ata4: stat1=0x50 err=0x01 lsb=0x00 msb=0x00
lust kernel: ata4: reset tp2 stat0=50 stat1=50 devices=0x3
lust kernel: ata4: [MPSAFE]
lust kernel: ata4: [ITHREAD]
lust kernel: pci0: <serial bus, SMBus> at device 31.3 (no driver attached)
lust kernel: atapci2: <Intel ICH10 SATA300 controller> port 0xb400-0xb407,0xb080-0xb083,0xb000-0xb007,0xac00-0xac03,0xa880-0xa88f,0xa800-0xa80f irq 19 at device 31.5 on pci0
lust kernel: atapci2: Reserved 0x10 bytes for rid 0x20 type 4 at 0xa880
lust kernel: atapci2: [MPSAFE]
lust kernel: atapci2: [ITHREAD]
lust kernel: atapci2: Reserved 0x10 bytes for rid 0x24 type 4 at 0xa800
lust kernel: ata5: <ATA channel 0> on atapci2
lust kernel: atapci2: Reserved 0x8 bytes for rid 0x10 type 4 at 0xb400
lust kernel: atapci2: Reserved 0x4 bytes for rid 0x14 type 4 at 0xb080
lust kernel: ata5: reset tp1 mask=03 ostat0=7f ostat1=7f
lust kernel: ata5: stat0=0x7f err=0xff lsb=0xff msb=0xff
lust kernel: ata5: stat1=0x7f err=0xff lsb=0xff msb=0xff
lust kernel: ata5: reset tp2 stat0=ff stat1=ff devices=0x0
lust kernel: ata5: [MPSAFE]
lust kernel: ata5: [ITHREAD]
lust kernel: ata6: <ATA channel 1> on atapci2
lust kernel: atapci2: Reserved 0x8 bytes for rid 0x18 type 4 at 0xb000
lust kernel: atapci2: Reserved 0x4 bytes for rid 0x1c type 4 at 0xac00
lust kernel: ata6: reset tp1 mask=03 ostat0=7f ostat1=7f
lust kernel: ata6: stat0=0x7f err=0xff lsb=0xff msb=0xff
lust kernel: ata6: stat1=0x7f err=0xff lsb=0xff msb=0xff
lust kernel: ata6: reset tp2 stat0=ff stat1=ff devices=0x0
lust kernel: ata6: [MPSAFE]
lust kernel: ata6: [ITHREAD]
lust kernel: ata2: identify ch->devices=00010000
lust kernel: ata2-master: pio=PIO4 wdma=WDMA2 udma=UDMA33 cable=40 wire
lust kernel: acd0: setting PIO4 on IT8213F chip
lust kernel: acd0: setting UDMA33 on IT8213F chip
lust kernel: acd0: <ATAPI DVD D DH16D3P/1P52> DVDROM drive at ata2 as master
lust kernel: acd0: read 8268KB/s (8268KB/s), 198KB buffer, UDMA33
lust kernel: acd0: Reads: CDR, CDRW, CDDA stream, DVDROM, DVDR, DVDRAM, packet
lust kernel: acd0: Writes:
lust kernel: acd0: Audio: play, 256 volume levels
lust kernel: acd0: Mechanism: ejectable tray, unlocked
lust kernel: acd0: Medium: no/blank disc
lust kernel: ata3: identify ch->devices=00000003
lust kernel: ata3-master: pio=PIO4 wdma=WDMA2 udma=UDMA133 cable=40 wire
lust kernel: ata3-slave: pio=PIO4 wdma=WDMA2 udma=UDMA133 cable=40 wire
lust kernel: ad6: 953869MB <Seagate ST31000333AS CC1F> at ata3-master SATA300
lust kernel: ad6: 1953525168 sectors [1938021C/16H/63S] 16 sectors/interrupt 1 depth queue
lust kernel: GEOM: new disk ad6
lust kernel: ad7: 953869MB <Seagate ST31000333AS CC1F> at ata3-slave SATA300
lust kernel: ad7: 1953525168 sectors [1938021C/16H/63S] 16 sectors/interrupt 1 depth queue
lust kernel: ata4: identify ch->devices=00000003
lust kernel: ata4-master: pio=PIO4 wdma=WDMA2 udma=UDMA133 cable=40 wire
lust kernel: GEOM: new disk ad7
lust kernel: ata4-slave: pio=PIO4 wdma=WDMA2 udma=UDMA133 cable=40 wire
lust kernel: ad8: 953869MB <Seagate ST31000333AS CC1F> at ata4-master SATA300
lust kernel: ad8: 1953525168 sectors [1938021C/16H/63S] 16 sectors/interrupt 1 depth queue
lust kernel: GEOM: new disk ad8
lust kernel: ad9: 953869MB <Seagate ST31000333AS CC1F> at ata4-slave SATA300
lust kernel: ad9: 1953525168 sectors [1938021C/16H/63S] 16 sectors/interrupt 1 depth queue
lust kernel: ata5: identify ch->devices=00000000
lust kernel: ata6: identify ch->devices=00000000
lust kernel: ioapic0: Assigning ISA IRQ 1 to local APIC 0
lust kernel: ioapic0: Assigning ISA IRQ 9 to local APIC 1
lust kernel: ioapic0: Assigning PCI IRQ 17 to local APIC 0
lust kernel: ioapic0: Assigning PCI IRQ 18 to local APIC 1
lust kernel: ioapic0: Assigning PCI IRQ 19 to local APIC 0
lust kernel: ioapic0: Assigning PCI IRQ 22 to local APIC 1
lust kernel: ioapic0: Assigning PCI IRQ 23 to local APIC 0
lust kernel: GEOM: new disk ad9

_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."

Re: SATA DMA errors on second ICH10 bus

by Nenhum_de_Nos-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

may be related to this, I found on vr-zone.com:
http://www.theinquirer.net/inquirer/news/374/1050374/seagate-barracudas-7200-11-failing

matheus

--
We will call you cygnus,
The God of balance you shall be

_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."

Re: SATA DMA errors on second ICH10 bus

by jkc120 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, Jan 18, 2009 at 8:29 PM, Nenhum_de_Nos <matheus@...> wrote:
> may be related to this, I found on vr-zone.com:
> http://www.theinquirer.net/inquirer/news/374/1050374/seagate-barracudas-7200-11-failing

Quite possible. I have two ST31000340AS drives with the bad SD15
firmware and it threw a bunch of DMA errors last night, and smart
shows:

Error 2 occurred at disk power-on lifetime: 366 hours (15 days + 6 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 42 53 d7 0d  Error: UNC at LBA = 0x0dd75342 = 232215362

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 ff 52 d7 4d 00   5d+06:34:09.759  READ DMA
  35 00 00 ff ff ff 4f 00   5d+06:34:09.741  WRITE DMA EXT
  c8 00 00 ff 51 d7 4d 00   5d+06:34:09.687  READ DMA
  35 00 00 ff ff ff 4f 00   5d+06:34:09.633  WRITE DMA EXT
  c8 00 00 ff 50 d7 4d 00   5d+06:34:09.605  READ DMA

Error 1 occurred at disk power-on lifetime: 231 hours (9 days + 15 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff 4f 00   8d+19:11:08.269  READ DMA EXT
  25 00 00 ff ff ff 4f 00   8d+19:11:08.267  READ DMA EXT
  25 00 00 ff ff ff 4f 00   8d+19:11:08.261  READ DMA EXT
  25 00 00 ff ff ff 4f 00   8d+19:11:08.260  READ DMA EXT
  25 00 00 ff ff ff 4f 00   8d+19:11:08.257  READ DMA EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%       368
      232215362
# 2  Short offline       Completed: read failure       90%       368
      232215362
# 3  Short offline       Completed: read failure       90%       367
      232215362
# 4  Short offline       Completed without error       00%       224         -

I've got 2 WD black drives on order to replace these two ST31000340AS,
and originally my intention was to use them for separate filesystems,
but I think I'll gmirror them now, to be safe(er).

Josh
_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."

Re: SATA DMA errors on second ICH10 bus

by Dylan Alex Simon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>From Josh Carroll <josh.carroll@...>, Tue, Jan 20, 2009 at 09:41:06AM -0500:
> On Sun, Jan 18, 2009 at 8:29 PM, Nenhum_de_Nos <matheus@...> wrote:
> > may be related to this, I found on vr-zone.com:
> > http://www.theinquirer.net/inquirer/news/374/1050374/seagate-barracudas-7200-11-failing
>
> Quite possible. I have two ST31000340AS drives with the bad SD15
> firmware and it threw a bunch of DMA errors last night, and smart
> shows:

In this case I don't think that's the cause.  They do have the bad firmware
(which I'll update and test again as soon as Seagate provides a working fix),
but SMART reports no errors on any disk, and all self-tests past.  Linux has
been running on this machine for a while now under similar conditions with no
errors at all.  Also, there's a similar report with WD disks in ICH7.  (I've
opened kern/130726 for this.)

:-Dylan
_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."

Re: SATA DMA errors on second ICH10 bus

by Oliver Fromme :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Josh Carroll wrote:
 > I've got 2 WD black drives on order to replace these two ST31000340AS,
 > and originally my intention was to use them for separate filesystems,
 > but I think I'll gmirror them now, to be safe(er).

Some people recommend to use different vendors for the
components of disk mirrors, in order to reduce the
likelihood that both drives will fail at about the same
time due to a firmware bug or similar.

That advice seems to be particularly valuable given the
current firmware problems that particular Seagate disks
are exhibiting.

For example, I've got these in a server:

# atacontrol list | grep ad
    Master:  ad0 <SAMSUNG HD160JJ/WU100-41> Serial ATA II
    Master:  ad1 <ST3160811AS/3.AAE> Serial ATA v1.0
# diskinfo ad0 ad1
ad0     512     160041885696    312581808       310101  16      63
ad1     512     160041885696    312581808       310101  16      63
# gmirror status
      Name    Status  Components
mirror/gm0  COMPLETE  ad0
                      ad1

Best regards
   Oliver

--
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606,  Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün-
chen, HRB 125758,  Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen, -Produkte und mehr:  http://www.secnetix.de/bsd

"anyone new to programming should be kept as far from C++ as
possible;  actually showing the stuff should be considered a
criminal offence" -- Jacek Generowicz
_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."

Re: SATA DMA errors on second ICH10 bus

by Dylan Alex Simon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> That advice seems to be particularly valuable given the
> current firmware problems that particular Seagate disks
> are exhibiting.

I've confirmed with Seagate and others that the firmware these disks already
have (CC1F) is not affected by the firmware problems.  The instability (as
described in kern/130726) continues with a kernel from today.  I've traced it
down to exclusively and reliably being caused by access to disks on multiple
channels simultaneously (access to any pair of disks on the same channel works
fine).  If anyone has any suggestions or any other data I should collect let
me know as I will have to put these machines into production shortly (without
freebsd unfortunately).

:-Dylan
_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."

Re: SATA DMA errors on second ICH10 bus

by Christoph Mallon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Dylan Alex Simon schrieb:

>> That advice seems to be particularly valuable given the
>> current firmware problems that particular Seagate disks
>> are exhibiting.
>
> I've confirmed with Seagate and others that the firmware these disks already
> have (CC1F) is not affected by the firmware problems.  The instability (as
> described in kern/130726) continues with a kernel from today.  I've traced it
> down to exclusively and reliably being caused by access to disks on multiple
> channels simultaneously (access to any pair of disks on the same channel works
> fine).  If anyone has any suggestions or any other data I should collect let
> me know as I will have to put these machines into production shortly (without
> freebsd unfortunately).

I suspect I see the same problem with some nvidia SATA controller. If
there is high load on both channels of one controller, there are exactly
the errors you showed.
Your kernel does not use INVARIANTS, is this correct? Otherwise you
should see a very specific panic caused by a KASSERT(). I analysed the
problem a bit. You can see my findings in the thread "Question about
panic in brelse()".
I suspect a hardware bug plus incorrect error handling in the driver in
FreeBSD. As a workaround, I suggest you connect each disk to a separate
controller - if you have not more disks than controllers.
_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."

Re: SATA DMA errors on second ICH10 bus

by Dylan Alex Simon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> I suspect I see the same problem with some nvidia SATA controller. If  
> there is high load on both channels of one controller, there are exactly  
> the errors you showed.
> Your kernel does not use INVARIANTS, is this correct? Otherwise you  
> should see a very specific panic caused by a KASSERT(). I analysed the  
> problem a bit. You can see my findings in the thread "Question about  
> panic in brelse()".
> I suspect a hardware bug plus incorrect error handling in the driver in  
> FreeBSD. As a workaround, I suggest you connect each disk to a separate  
> controller - if you have not more disks than controllers.

When I do turn INVARIANTS on I ultimately get a number of different failures,
depending on what sort of operation I'm doing.  I think I've seen the brelse
panic you mentioned but not recently.  Here's one from today doing cp on ufs:

ad0: FAILURE - load data
ad0: setting up DMA failed
g_vfs_done():ad0s1e[READ(offset=1843986432, length=65536)]error = 5
vnode_pager_getpages: I/O read error
vm_fault: pager read error, pid 819 (cp)
kernel trap 9 with interrupts disabled

Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 00
instruction pointer     = 0x8:0xffffffff802ae9fe
stack pointer           = 0x10:0xfffffffeb61bfae0
frame pointer           = 0x10:0xfffffffeb61bfb00
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = resume, IOPL = 0
current process         = 12 (irq14: ata0)
lock order reversal: (Giant after non-sleepable)
 1st 0xffffffff80628750 bio queue (bio queue) @ /usr/src/sys/geom/geom_io.c:68
 2nd 0xffffffff8062b8c0 Giant (Giant) @ /usr/src/sys/dev/kbdmux/kbdmux.c:1044
KDB: stack backtrace:
panic: mutex Giant not owned at /usr/src/sys/kern/tty_ttydisc.c:1127
cpuid = 0

I certainly agree that there's some problems in error handling, but I'm more
concerned about the underlying problem causing the errors.  

:-Dylan
_______________________________________________
freebsd-current@... mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@..."