|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
Strange network hang on Poweredge 860Hello all,
I've been experiencing a very strange mode of failure which has me scratching my head so I figured I'd ask here to see if anybody had seen something like this before. I have installed NetBSD 3.1 on a brand new Dell PowerEdge 860 system (dual core P4 Xeon, 4GB ram, 2 SATA drives in software RAID using raidframe raid1). This system is in line to (once stable) replace an aging and slow box to take over POP, SMTP, DHCP, and secure login services for a decent sized pool of users. I cloned the old system from backups (using restore), put the GENERIC.MP kernel in place, and changed its hostname and IP. I also turned of dhcpd (so as not to stomp the live server), and let it run for a few weeks (logging in and using it from time to time, testing out patches and doing general system stuff). It was rock solid and very stable. So, we replaced the old system with our fancy new one, and four hours into operation, things get weird. The system is still running, everything seems okay, nothing unexpected or unpleasant in syslog, but the NIC is kaput. It sees link, seems to be okay, but it won't accept or make connections, pings, or any other network traffic. On speculation, we tried again with the non-MP kernel (just the i386 GENERIC) and it did it again, four hours into operation. We added another NIC (a RealTek NIC re0) and tried again using re0 as our primary NIC figuring different card, different driver, maybe it'll work. Nope. Not only did it hang up, but after the network hung up, I tried to bring bge0 up to see if _it_ could talk, but it seemed to be stuck too. (It's worth noting that they share an IRQ. Not sure if this has anything to do with it). So we put the old system back up, and pulled the new one (the 860) back into testing, but so far I have not been able to duplicate the failure. My first shot was to run stress and keep the system busy, but it passed that test with flying colors. Last night I ran it all night answering pointless login sessions (I made a script to SSH in and execute a bunch of various representative user activities on several test user accounts (stuff like reading the mail spool, sleeping, copying files, grepping logs, forwarding ports via SSH, etc...) and let it run under about the same load, number of users, etc... as our crash condition and it still has not crashed. There are a couple things I am not simulating at the moment: dhcpd, sendmail, and I'm not NFS mounting home directories with amd, but aside from that it is pretty darn close to the real running configuration. Has anybody seen this before, or does anybody have a good hunch about what I can do to duplicate the failure? Once I can duplicate it "in captivity" it will be easier to debug, and easier to correct, but I would love to be able to duplicate it without putting it up live and letting it crash because that is not only a lot of work, but it inconveniences users who need to use the system. Thanks for any insights, I'm tearing my hair out =:-/ -Lars Friend PS: I have included the output of dmesg in case that sheds any light: NetBSD 3.1 (GENERIC) #0: Tue Oct 31 04:27:07 UTC 2006 builds@...:/home/builds/ab/netbsd-3-1-RELEASE/i386/200610302053Z-obj/home/builds/ab/netbsd-3-1-RELEASE/src/sys/arch/i386/compile/GENERIC total memory = 3583 MB avail memory = 3498 MB BIOS32 rev. 0 found at 0xffe90 mainbus0 (root) cpu0 at mainbus0: (uniprocessor) cpu0: Intel Pentium Pro, II or III (686-class), 2400.18 MHz, id 0x6f6 cpu0: features bfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR> cpu0: features bfebfbff<PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX> cpu0: features bfebfbff<FXSR,SSE,SSE2,SS,HTT,TM,SBF> cpu0: features2 e3bd<SSE3,MONITOR,DS-CPL,VMX,EST,TM2,xTPR> cpu0: "Intel(R) Xeon(R) CPU 3060 @ 2.40GHz" cpu0: I-cache 32 KB 64B/line 8-way, D-cache 32 KB 64B/line 8-way cpu0: running without thermal monitor! cpu0: Enhanced SpeedStep disabled by BIOS pci0 at mainbus0 bus 0: configuration mode 1 pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok pchb0 at pci0 dev 0 function 0 pchb0: Intel product 0x2778 (rev. 0x00) ppb0 at pci0 dev 1 function 0: Intel product 0x2779 (rev. 0x00) pci1 at ppb0 bus 1 pci1: i/o space, memory space enabled, rd/line, wr/inv ok ppb1 at pci0 dev 28 function 0: Intel 82801GB/GR PCI Express Port #1 (rev. 0x01) pci2 at ppb1 bus 2 pci2: i/o space, memory space enabled, rd/line, wr/inv ok ppb2 at pci2 dev 0 function 0: Intel product 0x032c (rev. 0x09) pci3 at ppb2 bus 3 pci3: i/o space, memory space enabled, rd/line, wr/inv ok re0 at pci3 dev 2 function 0: RealTek 8169S Single-chip Gigabit Ethernet re0: interrupting at irq 5 re0: Ethernet address 00:14:6c:cb:68:dc re0: using 256 tx descriptors ukphy0 at re0 phy 7: Generic IEEE 802.3u media interface ukphy0: RTL8169S/8110S 1000BASE-T media interface (OUI 0x00e04c, model 0x0011), rev. 0 ukphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto ppb3 at pci0 dev 28 function 4: Intel 82801GB/GR PCI Express Port #5 (rev. 0x01) pci4 at ppb3 bus 4 pci4: i/o space, memory space enabled, rd/line, wr/inv ok bge0 at pci4 dev 0 function 0: Broadcom BCM5721 Gigabit Ethernet bge0: interrupting at irq 3 bge0: PCI-Express DMA setting 0x76180000, expected 0x76180000 bge0: ASIC BCM5751 A1 (0x4101), Ethernet address 00:19:b9:f7:47:a2 bge0: setting short Tx thresholds brgphy0 at bge0 phy 1: BCM5750 1000BASE-T media interface, rev. 0 brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto ppb4 at pci0 dev 28 function 5: Intel 82801GB/GR PCI Express Port #6 (rev. 0x01) pci5 at ppb4 bus 5 pci5: i/o space, memory space enabled, rd/line, wr/inv ok bge1 at pci5 dev 0 function 0: Broadcom BCM5721 Gigabit Ethernet bge1: interrupting at irq 11 bge1: PCI-Express DMA setting 0x76180000, expected 0x76180000 bge1: ASIC BCM5751 A1 (0x4101), Ethernet address 00:19:b9:f7:47:a3 bge1: setting short Tx thresholds brgphy1 at bge1 phy 1: BCM5750 1000BASE-T media interface, rev. 0 brgphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto uhci0 at pci0 dev 29 function 0: Intel 82801GB/GR USB UHCI Controller (rev. 0x01) uhci0: interrupting at irq 11 usb0 at uhci0: USB revision 1.0 uhub0 at usb0 uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 2 ports with 2 removable, self powered uhci1 at pci0 dev 29 function 1: Intel 82801GB/GR USB UHCI Controller (rev. 0x01) uhci1: interrupting at irq 10 usb1 at uhci1: USB revision 1.0 uhub1 at usb1 uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub1: 2 ports with 2 removable, self powered uhci2 at pci0 dev 29 function 2: Intel 82801GB/GR USB UHCI Controller (rev. 0x01) uhci2: interrupting at irq 6 usb2 at uhci2: USB revision 1.0 uhub2 at usb2 uhub2: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub2: 2 ports with 2 removable, self powered ehci0 at pci0 dev 29 function 7: Intel 82801GB/GR USB EHCI Controller (rev. 0x01) ehci0: interrupting at irq 11 ehci0: BIOS has given up ownership ehci0: EHCI version 1.0 ehci0: wrong number of companions (7 != 3) ehci0: companion controllers, 2 ports each: uhci0 uhci1 uhci2 usb3 at ehci0: USB revision 2.0 uhub3 at usb3 uhub3: Intel EHCI root hub, class 9/0, rev 2.00/1.00, addr 1 uhub3: single transaction translator uhub3: 6 ports with 6 removable, self powered ppb5 at pci0 dev 30 function 0: Intel 82801BA Hub-PCI Bridge (rev. 0xe1) pci6 at ppb5 bus 6 pci6: i/o space, memory space enabled vga1 at pci6 dev 5 function 0: ATI Technologies product 0x515e (rev. 0x02) wsdisplay0 at vga1 kbdmux 1: console (80x25, vt100 emulation) wsmux1: connecting to wsdisplay0 pcib0 at pci0 dev 31 function 0 pcib0: Intel 82801GB/GR LPC Interface Bridge (rev. 0x01) piixide0 at pci0 dev 31 function 1 piixide0: Intel 82801GB/GR IDE Controller (ICH7) (rev. 0x01) piixide0: bus-master DMA support present piixide0: primary channel configured to compatibility mode piixide0: primary channel interrupting at irq 14 atabus0 at piixide0 channel 0 piixide0: secondary channel configured to compatibility mode piixide0: secondary channel ignored (disabled) piixide1 at pci0 dev 31 function 2 piixide1: Intel 82801GB/GR Serial ATA/Raid Controller (ICH7) (rev. 0x01) piixide1: bus-master DMA support present piixide1: primary channel configured to native-PCI mode piixide1: using irq 11 for native-PCI interrupt atabus1 at piixide1 channel 0 piixide1: secondary channel configured to native-PCI mode atabus2 at piixide1 channel 1 Intel 82801GB/GR SMBus Controller (SMBus serial bus, revision 0x01) at pci0 dev 31 function 3 not configured isa0 at pcib0 com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo pckbc0 at isa0 port 0x60-0x64 pckbd0 at pckbc0 (kbd slot) pckbc0: using irq 1 for kbd slot wskbd0 at pckbd0: console keyboard, using wsdisplay0 pcppi0 at isa0 port 0x61 midi0 at pcppi0: PC speaker sysbeep0 at pcppi0 isapnp0 at isa0 port 0x279: ISA Plug 'n Play device support npx0 at isa0 port 0xf0-0xff: using exception 16 isapnp0: no ISA Plug 'n Play devices found Kernelized RAIDframe activated atapibus0 at atabus0: 2 targets cd0 at atapibus0 drive 0: <HL-DT-STCD-RW/DVD-ROM GCC-4244N, , B101> cdrom removable cd0: 32-bit data port cd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 2 (Ultra/33) cd0(piixide0:0:0): using PIO mode 4, Ultra-DMA mode 2 (Ultra/33) (using DMA) uhub4 at uhub3 port 3 uhub4: Cypress Semiconductor USB2 Hub, class 9/0, rev 2.00/0.0b, addr 2 uhub4: multiple transaction translators uhub4: 4 ports with 4 removable, self powered wd0 at atabus1 drive 0: <ST3500630NS> wd0: drive supports 16-sector PIO transfers, LBA48 addressing wd0: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168 sectors wd0: 32-bit data port wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) wd0(piixide1:0:0): using PIO mode 4, Ultra-DMA mode 6 (Ultra/133) (using DMA) wd1 at atabus2 drive 0: <ST3500630NS> wd1: drive supports 16-sector PIO transfers, LBA48 addressing wd1: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168 sectors wd1: 32-bit data port wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) wd1(piixide1:1:0): using PIO mode 4, Ultra-DMA mode 6 (Ultra/133) (using DMA) raid0: RAID Level 1 raid0: Components: /dev/wd0a /dev/wd1a raid0: Total Sectors: 976772992 (476939 MB) boot device: raid0 root on raid0a dumps on raid0b root file system type: ffs uhidev0 at uhub2 port 1 configuration 1 interface 0 uhidev0: Avocent Avocent USBIAC, rev 1.10/1.00, addr 2, iclass 3/1 ukbd0 at uhidev0 wskbd1 at ukbd0 mux 1 wskbd1: connecting to wsdisplay0 uhidev1 at uhub2 port 1 configuration 1 interface 1 uhidev1: Avocent Avocent USBIAC, rev 1.10/1.00, addr 2, iclass 3/1 uhidev1: 3 report ids ums0 at uhidev1 reportid 1: 5 buttons and Z dir. wsmouse0 at ums0 mux 0 uhid0 at uhidev1 reportid 2: input=2, output=0, feature=0 uhid1 at uhidev1 reportid 3: input=1, output=0, feature=0 wsdisplay0: screen 1 added (80x25, vt100 emulation) wsdisplay0: screen 2 added (80x25, vt100 emulation) wsdisplay0: screen 3 added (80x25, vt100 emulation) wsdisplay0: screen 4 added (80x25, vt100 emulation) |
|
|
Re: Strange network hang on Poweredge 860Hi, Lars--
On Sep 10, 2007, at 11:34 AM, Lars Friend wrote: > Hello all, > I've been experiencing a very strange mode of failure which > has me > scratching my head so I figured I'd ask here to see if anybody had > seen > something like this before. > > I have installed NetBSD 3.1 on a brand new Dell PowerEdge 860 > system (dual core P4 Xeon, 4GB ram, 2 SATA drives in software RAID > using > raidframe raid1). > So, we replaced the old system with our fancy new one, and > four hours > into operation, things get weird. The system is still running, > everything seems okay, > nothing unexpected or unpleasant in syslog, but the NIC is kaput. > It sees link, seems to be > okay, but it won't accept or make connections, pings, or any other > network traffic. [ ... ] > Has anybody seen this before, or does anybody have a good > hunch about what I can do > to duplicate the failure? Once I can duplicate it "in captivity" > it will be easier to debug, and easier > to correct, but I would love to be able to duplicate it without > putting it up live and letting it crash because > that is not only a lot of work, but it inconveniences users who > need to use the system. There were a number of problems with the Broadcom NICs in Dell machines reported on the FreeBSD lists, particularly in conjunction with heavy UDP traffic such as NFS using the default transport. It seems like the NIC would get confused about the state of the transmit and receive buffers (some kind of refcounting problem?), and stop passing traffic entirely, which sounds similar to the problem you've reported. There were also some initialization issues which tended to occur if the NIC needed to be reset/woken up after entering an ACPI sleep state, doing WOL, or similar. One of their engineers, David Christensen <davidch@...> has done work to fix them and to improve the diagnostic messages so that better information is reported when the adaptor gets confused. You might find the threads here: http://lists.freebsd.org/pipermail/freebsd-net/2007-June/thread.html ...such as "Problems with BCE network adapter (Dell PE2950)" to contain some helpful info and code patches. It seems like the OpenBSD folks have also implemented some fixes and workarounds for PHY bugs in the BCM 575x/578x chipsets, going by: http://leaf.dragonflybsd.org/mailarchive/commits/2007-05/ msg00036.html Perhaps someone more familiar with the status of the BCM driver in NetBSD could offer more detailed information than I can, but at least you've got a starting point and the name of an Broadcom engineer who has worked on their BSD drivers. Regards, -- -Chuck PS: I wouldn't swap in a RealTek NIC given a choice-- the newer NICs from them aren't bad, but the older ones seemed to be flaky as well; instead I'd try a Intel Fast EtherExpress Pro ("fxp" to me, I think NetBSD calls 'em "wm", though), or the DEC "tulip" 21x4x chips ("dc" or "de" probably?).... |
|
|
Re: Strange network hang on Poweredge 860On Mon, Sep 10, 2007 at 02:34:02PM -0400, Lars Friend wrote:
> Hello all, > I've been experiencing a very strange mode of failure which has me > scratching my head so I figured I'd ask here to see if anybody had seen > something like this before. > > I have installed NetBSD 3.1 on a brand new Dell PowerEdge 860 > system (dual core P4 Xeon, 4GB ram, 2 SATA drives in software RAID using > raidframe raid1). > > This system is in line to (once stable) replace an aging and slow > box > to take over POP, SMTP, DHCP, and secure login services for a decent > sized pool of users. I cloned the old system from backups (using restore), > put the GENERIC.MP kernel in place, and changed its hostname and IP. > I also turned of dhcpd (so as not to stomp the live server), and let > it run for a few > weeks (logging in and using it from time to time, testing out patches and > doing general system stuff). It was rock solid and very stable. > > So, we replaced the old system with our fancy new one, and four > hours > into operation, things get weird. The system is still running, > everything seems okay, > nothing unexpected or unpleasant in syslog, but the NIC is kaput. It > sees link, seems to be > okay, but it won't accept or make connections, pings, or any other > network traffic. > [..] maybe nmbcluster is too low ? look at netstat -m/vmstat -m when this happens. You can also try to rebuild a kernel with options NMBCLUSTERS=8192 and see how it goes. You may also want to try a netbsd-3 kernel, there has been one pullup to if_bge.c since netbsd-3-1-RELEASE -- Manuel Bouyer <bouyer@...> NetBSD: 26 ans d'experience feront toujours la difference -- |
|
|
removeremove
-----Original Message----- From: netbsd-help-owner@... [mailto:netbsd-help-owner@...] On Behalf Of Manuel Bouyer Sent: 2007 09 11 4:03 To: Lars Friend Cc: netbsd-help@... Subject: Re: Strange network hang on Poweredge 860 On Mon, Sep 10, 2007 at 02:34:02PM -0400, Lars Friend wrote: > Hello all, > I've been experiencing a very strange mode of failure which has me > scratching my head so I figured I'd ask here to see if anybody had seen > something like this before. > > I have installed NetBSD 3.1 on a brand new Dell PowerEdge 860 > system (dual core P4 Xeon, 4GB ram, 2 SATA drives in software RAID using > raidframe raid1). > > This system is in line to (once stable) replace an aging and slow > box > to take over POP, SMTP, DHCP, and secure login services for a decent > sized pool of users. I cloned the old system from backups (using > put the GENERIC.MP kernel in place, and changed its hostname and IP. > I also turned of dhcpd (so as not to stomp the live server), and let > it run for a few > weeks (logging in and using it from time to time, testing out patches and > doing general system stuff). It was rock solid and very stable. > > So, we replaced the old system with our fancy new one, and four > hours > into operation, things get weird. The system is still running, > everything seems okay, > nothing unexpected or unpleasant in syslog, but the NIC is kaput. It > sees link, seems to be > okay, but it won't accept or make connections, pings, or any other > network traffic. > [..] maybe nmbcluster is too low ? look at netstat -m/vmstat -m when this happens. You can also try to rebuild a kernel with options NMBCLUSTERS=8192 and see how it goes. You may also want to try a netbsd-3 kernel, there has been one pullup to if_bge.c since netbsd-3-1-RELEASE -- Manuel Bouyer <bouyer@...> NetBSD: 26 ans d'experience feront toujours la difference -- |
| Free embeddable forum powered by Nabble | Forum Help |