|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
|
|
SMP stability issuesHi,
For the last couple of months I've been running NetBSD 3.0.1 and 3.1 (since yesterday) on an Abit VP6 SMP motherboard with two P3 866's. The system is mainly used as a mail, web, and Samba server, along with occasional other odd tasks. When I run off the GENERIC kernel, the machine is rock solid stable. However, when I use either GENERIC.MP or my own kernel (which is basically GENERIC.MP with pcmcia and sound support removed), it invariably locks up after a time running. It is a hard lockup, nothing will revive it other than hitting the reset switch. The uptime before the lockup has so far varied between about 1 hour and 6 days. There doesn't seem to be any pattern to it, other than the fact that it only happens when running an SMP kernel. I can't find anything in the logs to give any clues. I'm pretty sure it's not a hardware fault, as I've tested everything I can think of. Added to that, prior to running NetBSD the box ran Linux (in SMP mode) without any problems (uptime was 193 days when I took it down to install NetBSD). The root filesystem is on RAIDFrame, if it makes any difference. Does anyone have any ideas about what could be causing this, or any troubleshooting clues? Needless to say, it's a very irritating problem. Thanks in advance, Chris. |
|
|
Re: SMP stability issuesOn Fri, 10 Nov 2006, Chris Rendle-Short wrote:
> Does anyone have any ideas about what could be causing this, or any > troubleshooting clues? Needless to say, it's a very irritating problem. What kernel are you seeing these instabilities with? - Hubert |
|
|
Re: SMP stability issuesOn Fri, 10 Nov 2006 09:48:16 +0100 (CET), Hubert Feyrer <hubert@...> wrote: > On Fri, 10 Nov 2006, Chris Rendle-Short wrote: >> Does anyone have any ideas about what could be causing this, or any >> troubleshooting clues? Needless to say, it's a very irritating problem. > > What kernel are you seeing these instabilities with? > > > - Hubert I am getting the lockups with GENERIC.MP. GENERIC has no problems. Chris. |
|
|
Re: SMP stability issuesOn Fri, 10 Nov 2006, Chris Rendle-Short wrote:
>> What kernel are you seeing these instabilities with? > > I am getting the lockups with GENERIC.MP. GENERIC has no problems. From what NetBSD Version - latest -current, 3.1, ...? - Hubert |
|
|
Re: SMP stability issuesOn Fri, 10 Nov 2006 10:13:39 +0100 (CET), Hubert Feyrer <hubert@...> wrote: > On Fri, 10 Nov 2006, Chris Rendle-Short wrote: >>> What kernel are you seeing these instabilities with? >> >> I am getting the lockups with GENERIC.MP. GENERIC has no problems. > > From what NetBSD Version - latest -current, 3.1, ...? > > > - Hubert I first installed from 3.0.1, and have recently upgraded to 3.1 by following netbsd-3. The stability problem has occurred in both 3.0.1 and 3.1 Chris. |
|
|
Re: SMP stability issues>>>>> On Fri, 10 Nov 2006 21:10:56 +1100,
Chris Rendle-Short <jim@...> said: > I first installed from 3.0.1, and have recently upgraded to 3.1 by > following netbsd-3. The stability problem has occurred in both 3.0.1 > and 3.1 Have you tried some hardware diag tool like memtest86+? We recently had a stability issue, and we didn't think it as a hardware problem, since the machine had worked fine until we replaced its kernel. (The machine also did run memtest86+ fine, when it was bought). But actually it was a hardware issue in our case. When we tried memtest86+ finally, lots of RAM problems were detected. FWIW, 3.x kernel run fine with GENERIC.MPACPI configuration on my Athlon 64 X2. -- soda |
|
|
Re: SMP stability issuesOn Fri, 10 Nov 2006 19:22:40 +0900, SODA Noriyuki <soda@...> wrote: > > Have you tried some hardware diag tool like memtest86+? > > We recently had a stability issue, and we didn't think it as a > hardware problem, since the machine had worked fine until we > replaced its kernel. (The machine also did run memtest86+ fine, > when it was bought). > But actually it was a hardware issue in our case. When we tried > memtest86+ finally, lots of RAM problems were detected. > > FWIW, 3.x kernel run fine with GENERIC.MPACPI configuration > on my Athlon 64 X2. > -- > soda Yes, I've run memtest86 and memtest86+ and found no errors. I've also checked all the HDDs with MHDD (http://hddguru.com/content/en/software/2005.10.02-MHDD/), and found no problems. I haven't tried GENERIC.MPACPI yet. As I understood it is mainly intended for duel core systems and systems with HT. As mine doesn't have HT and is physically two seperate CPUs, I went with GENERIC.MP. I have also noticed a SWINGER kernel config file, with an accompanying SWINGER.MP. I'm not too certain why SWINGER is included in the source tree, but it is descirbed as "thorpej's Abit BP6+dual Celeron". Interesting because the motherboard I am using, an Abit VP6, is the successor to the BP6. I can't find anything special about SWINGER though, except that it is customised to only work on the BP6. Chris. |
|
|
Re: SMP stability issuesChris Rendle-Short <jim@...> writes: > I haven't tried GENERIC.MPACPI yet. As I understood it is mainly > intended for duel core systems and systems with HT. Not really -- it is for machines that have ACPI. In general, I've been finding of late that many ACPI supporting boxes just don't run right if you don't use ACPI... Perry |
|
|
Re: SMP stability issuesOn Fri, Nov 10, 2006 at 01:16:52PM -0500, Perry E. Metzger wrote:
> > Chris Rendle-Short <jim@...> writes: > > I haven't tried GENERIC.MPACPI yet. As I understood it is mainly > > intended for duel core systems and systems with HT. > > Not really -- it is for machines that have ACPI. In general, I've been > finding of late that many ACPI supporting boxes just don't run right > if you don't use ACPI... .. to the extent of often needing to build custom INSTALL kernels with acpi on i386.. Cheers, Patrick |
|
|
Re: SMP stability issuesPatrick Welche <prlw1@...> writes: > On Fri, Nov 10, 2006 at 01:16:52PM -0500, Perry E. Metzger wrote: >> Chris Rendle-Short <jim@...> writes: >> > I haven't tried GENERIC.MPACPI yet. As I understood it is mainly >> > intended for duel core systems and systems with HT. >> >> Not really -- it is for machines that have ACPI. In general, I've been >> finding of late that many ACPI supporting boxes just don't run right >> if you don't use ACPI... > > .. to the extent of often needing to build custom INSTALL kernels > with acpi on i386.. Indeed. I've had horrible instability trying to do installs without ACPI on a few times. Anyway, I think the general message is "if your machine has ACPI, try things with ACPI on if you're having problems -- manufacturers aren't testing non-ACPI very well any more." Perry |
|
|
Re: SMP stability issuesOn Fri, 10 Nov 2006 19:27:22 +1100
Chris Rendle-Short <jim@...> wrote: > Hi, > > For the last couple of months I've been running NetBSD 3.0.1 and 3.1 (since yesterday) on an Abit VP6 SMP motherboard with two P3 866's. The system is mainly used as a mail, web, and Samba server, along with occasional other odd tasks. Ah, the good old Abit VP6 motherboard. I have one myself. It was very cheap and stable. At least until the capacitors started to explode.. I had them replaced but it was never quite as stable as it once had been. I could not get it to draw graphics without hanging rock solid after a few seconds. No matter what graphics card I used (or OS). I also tried switching power supply but it did not help. In textmode it was kind of ok. Could multi-job compile for several days without showing any problems. These days one of the CPUs is doing its duty in another (non-SMP) motherboard, and the other is resting peacefully in the ever growing pile of old junk in my home lab.. Best regards, Lars Nordlund |
|
|
Re: SMP stability issuesOn Fri, 10 Nov 2006 22:14:47 +0100, Lars Nordlund <lars.nordlund@...> wrote: > > Ah, the good old Abit VP6 motherboard. I have one myself. It was very > cheap and stable. At least until the capacitors started to explode.. I > had them replaced but it was never quite as stable as it once had been. > I could not get it to draw graphics without hanging rock solid after a > few seconds. No matter what graphics card I used (or OS). I also tried > switching power supply but it did not help. In textmode it was kind of > ok. Could multi-job compile for several days without showing any > problems. > > These days one of the CPUs is doing its duty in another (non-SMP) > motherboard, and the other is resting peacefully in the ever growing > pile of old junk in my home lab.. > > > Best regards, > Lars Nordlund Yes, I had capacitor problems too. It drove me crazy trying to work out what was going on until I read about the Capacitor Problem (http://www.dashdist.com/1u2u/company/capacitor.html). I replaced them all, and it has been perfectly stable ever since. Well, under Linux anyway (and NetBSD on one CPU). On Fri, 10 Nov 2006 13:52:55 -0500, "Perry E. Metzger" <perry@...> wrote: > >>> Not really -- it is for machines that have ACPI. In general, I've been >>> finding of late that many ACPI supporting boxes just don't run right >>> if you don't use ACPI... >> >> .. to the extent of often needing to build custom INSTALL kernels >> with acpi on i386.. > > Indeed. I've had horrible instability trying to do installs without > ACPI on a few times. > > Anyway, I think the general message is "if your machine has ACPI, > try things with ACPI on if you're having problems -- manufacturers > aren't testing non-ACPI very well any more." > > Perry Ah, now this is interesting. I did not realise that there was problems running a non-ACPI kernel on an ACPI system. Thanks for the info guys, I will build GENERIC.MPACPI and see how it goes. Chris. |
|
|
Re: SMP stability issuesOn Fri, Nov 10, 2006 at 07:27:22PM +1100, Chris Rendle-Short wrote:
> Hi, > > For the last couple of months I've been running NetBSD 3.0.1 and 3.1 (since yesterday) on an Abit VP6 SMP motherboard with two P3 866's. The system is mainly used as a mail, web, and Samba server, along with occasional other odd tasks. > > When I run off the GENERIC kernel, the machine is rock solid stable. However, when I use either GENERIC.MP or my own kernel (which is basically GENERIC.MP with pcmcia and sound support removed), it invariably locks up after a time running. It is a hard lockup, nothing will revive it other than hitting the reset switch. > > The uptime before the lockup has so far varied between about 1 hour and 6 days. There doesn't seem to be any pattern to it, other than the fact that it only happens when running an SMP kernel. I can't find anything in the logs to give any clues. > > I'm pretty sure it's not a hardware fault, as I've tested everything I can think of. Added to that, prior to running NetBSD the box ran Linux (in SMP mode) without any problems (uptime was 193 days when I took it down to install NetBSD). The root filesystem is on RAIDFrame, if it makes any difference. > > Does anyone have any ideas about what could be causing this, or any troubleshooting clues? Needless to say, it's a very irritating problem. What chipset does this motherboard have ? Can you post the dmesg ? Also, you could try to build a kernel with DIAGNOSTIC, DEBUG and LOCKDEBUG options. A hard hang like that could be a deadlock in the kernel; one of these options may help to find what's going on. -- Manuel Bouyer <bouyer@...> NetBSD: 26 ans d'experience feront toujours la difference -- |
|
|
Re: SMP stability issuesOn Sun, 12 Nov 2006 00:11:39 +0100, Manuel Bouyer <bouyer@...> wrote:
> What chipset does this motherboard have ? Can you post the dmesg ? > > Also, you could try to build a kernel with DIAGNOSTIC, DEBUG and LOCKDEBUG > options. A hard hang like that could be a deadlock in the kernel; > one of these options may help to find what's going on. > > -- > Manuel Bouyer <bouyer@...> > NetBSD: 26 ans d'experience feront toujours la difference > -- Well, I just tried running GENERIC.MPACPI like some of the others suggested, however it is still locking up. Here is the dmesg from GENERIC.MPACPI (although it looks like I might need to check my ACPI configuration in the BIOS. I will also try a kernel with DIAGNOSTIC, DEBUG and LOCKDEBUG enabled as you suggested. Is it likely to matter whether or not ACPI is enabled in the test kernel? Thanks, Chris. NetBSD 3.1_STABLE (GENERIC.MPACPI) #0: Sat Nov 11 10:18:33 EST 2006 jim@...:/usr/src/sys/arch/i386/compile/GENERIC.MPACPI total memory = 511 MB avail memory = 492 MB BIOS32 rev. 0 found at 0xfb340 mainbus0 (root) cpu0 at mainbus0: apid 0 (boot processor) cpu0: Intel Pentium III (686-class), 865.29 MHz, id 0x686 cpu0: features 387fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR> cpu0: features 387fbff<PGE,MCA,CMOV,PAT,PSE36,PN,MMX> cpu0: features 387fbff<FXSR,SSE> cpu0: I-cache 16 KB 32B/line 4-way, D-cache 16 KB 32B/line 4-way cpu0: L2 cache 256 KB 32B/line 8-way cpu0: ITLB 32 4 KB entries 4-way, 2 4 MB entries fully associative cpu0: DTLB 64 4 KB entries 4-way, 8 4 MB entries 4-way cpu0: serial number 0000-0686-0003-8754-64B0-C402 cpu0: calibrating local timer cpu0: apic clock running at 133 MHz cpu0: 8 page colors cpu1 at mainbus0: apid 1 (application processor) cpu1: starting cpu1: Intel Pentium III (686-class), 865.25 MHz, id 0x686 cpu1: features 387fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR> cpu1: features 387fbff<PGE,MCA,CMOV,PAT,PSE36,PN,MMX> cpu1: features 387fbff<FXSR,SSE> cpu1: I-cache 16 KB 32B/line 4-way, D-cache 16 KB 32B/line 4-way cpu1: L2 cache 256 KB 32B/line 8-way cpu1: ITLB 32 4 KB entries 4-way, 2 4 MB entries fully associative cpu1: DTLB 64 4 KB entries 4-way, 8 4 MB entries 4-way cpu1: serial number 0000-0686-0003-FE6E-375F-F1CD ioapic0 at mainbus0 apid 2 (I/O APIC) ioapic0: pa 0xfec00000, version 11, 24 pins acpi0 at mainbus0 acpi0: using Intel ACPI CA subsystem version 20040211 acpi0: X/RSDT: OemId <VIA694,AWRDACPI,42302e31>, AslId <AWRD,00000000> acpi0: SCI interrupting at int 9 acpi0: fixed-feature power button present mpacpi: could not get bus number, assuming bus 0 ACPI Object Type 'Processor' (0x0c) at acpi0 not configured ACPI Object Type 'Processor' (0x0c) at acpi0 not configured acpibut0 at acpi0 (PNP0C0C): ACPI Power Button PNP0C01 [System Board] at acpi0 not configured PNP0A03 [PCI Bus] at acpi0 not configured PNP0C0F [PCI interrupt link device] at acpi0 not configured PNP0C0F [PCI interrupt link device] at acpi0 not configured PNP0C0F [PCI interrupt link device] at acpi0 not configured PNP0C02 [Plug and Play motherboard register resources] at acpi0 not configured PNP0000 [AT Interrupt Controller] at acpi0 not configured PNP0200 [AT DMA Controller] at acpi0 not configured PNP0100 [AT Timer] at acpi0 not configured PNP0B00 [AT Real-Time Clock] at acpi0 not configured PNP0800 [AT-style speaker sound] at acpi0 not configured npx0 at acpi0 (PNP0C04) npx0: io 0xf0-0xff irq 13 npx0: using exception 16 com0 at acpi0 (PNP0501-1) com0: io 0x3f8-0x3ff irq 4 com0: ns16550a, working fifo com1 at acpi0 (PNP0501-2) com1: io 0x2f8-0x2ff irq 3 com1: ns16550a, working fifo lpt0 at acpi0 (PNP0401) lpt0: io 0x378-0x37f,0x778-0x77b irq 7 drq 3 pci0 at mainbus0 bus 0: configuration mode 1 pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok pchb0 at pci0 dev 0 function 0 pchb0: VIA Technologies VT82C691 (Apollo Pro) Host-PCI (rev. 0xc4) agp0 at pchb0: aperture at 0xd0000000, size 0xf000000 ppb0 at pci0 dev 1 function 0: VIA Technologies VT82C598 (Apollo MVP3) CPU-AGP Bridge (rev. 0x00) pci1 at ppb0 bus 1 pci1: i/o space, memory space enabled vga0 at pci1 dev 0 function 0: Silicon Integrated System 6326 AGP VGA (rev. 0x0b) wsdisplay0 at vga0 kbdmux 1: console (80x25, vt100 emulation) wsmux1: connecting to wsdisplay0 pcib0 at pci0 dev 7 function 0 pcib0: VIA Technologies VT82C686A PCI-ISA Bridge (rev. 0x40) viaide0 at pci0 dev 7 function 1 viaide0: VIA Technologies VT82C686A (Apollo KX133) ATA100 controller viaide0: bus-master DMA support present viaide0: primary channel configured to compatibility mode viaide0: primary channel interrupting at ioapic0 pin 14 (irq 14) atabus0 at viaide0 channel 0 viaide0: secondary channel configured to compatibility mode viaide0: secondary channel interrupting at ioapic0 pin 15 (irq 15) atabus1 at viaide0 channel 1 uhci0 at pci0 dev 7 function 2: VIA Technologies VT83C572 USB Controller (rev. 0x16) uhci0: interrupting at ioapic0 pin 12 (irq 12) usb0 at uhci0: USB revision 1.0 uhub0 at usb0 uhub0: VIA Technologies UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 2 ports with 2 removable, self powered uhci1 at pci0 dev 7 function 3: VIA Technologies VT83C572 USB Controller (rev. 0x16) uhci1: interrupting at ioapic0 pin 12 (irq 12) usb1 at uhci1: USB revision 1.0 uhub1 at usb1 uhub1: VIA Technologies UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub1: 2 ports with 2 removable, self powered VIA Technologies VT82C686A SMBus Controller (miscellaneous bridge, revision 0x40) at pci0 dev 7 function 4 not configured adv1 at pci0 dev 9 function 0: AdvanSys ABP-9xxU SCSI adapter adv1: interrupting at ioapic0 pin 16 (irq 11) scsibus0 at adv1: 8 targets, 8 luns per target ex0 at pci0 dev 13 function 0: 3Com 3cSOHO100-TX 10/100 Ethernet (rev. 0x30) ex0: interrupting at ioapic0 pin 18 (irq 10) ex0: MAC address 00:04:76:36:cf:be ex0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, default 10baseT hptide0 at pci0 dev 14 function 0 hptide0: Triones/Highpoint HPT370 IDE Controller hptide0: bus-master DMA support present hptide0: primary channel wired to native-PCI mode hptide0: using ioapic0 pin 18 (irq 10) for native-PCI interrupt atabus2 at hptide0 channel 0 hptide0: secondary channel wired to native-PCI mode atabus3 at hptide0 channel 1 isa0 at pcib0 pcppi0 at isa0 port 0x61 midi0 at pcppi0: PC speaker sysbeep0 at pcppi0 isapnp0 at isa0 port 0x279: ISA Plug 'n Play device support isapnp0: no ISA Plug 'n Play devices found ioapic0: enabling Kernelized RAIDframe activated scsibus0: waiting 2 seconds for devices to settle... wd0 at atabus0 drive 0: <Maxtor 6E030L0> wd0: drive supports 16-sector PIO transfers, LBA addressing wd0: 29325 MB, 59582 cyl, 16 head, 63 sec, 512 bytes/sect x 60058656 sectors wd0: 32-bit data port wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) wd0(viaide0:0:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA) wd1 at atabus1 drive 0: <Maxtor 6E030L0> wd1: drive supports 16-sector PIO transfers, LBA addressing wd1: 29325 MB, 59582 cyl, 16 head, 63 sec, 512 bytes/sect x 60058656 sectors wd1: 32-bit data port wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) wd1(viaide0:1:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA) wd2 at atabus2 drive 0: <WDC WD800BB-00FJA0> wd2: drive supports 16-sector PIO transfers, LBA addressing wd2: 76319 MB, 155061 cyl, 16 head, 63 sec, 512 bytes/sect x 156301488 sectors wd2: 32-bit data port wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100) wd3 at atabus2 drive 1: <WDC WD800BB-00JHA0> wd3: drive supports 16-sector PIO transfers, LBA addressing wd3: 76319 MB, 155061 cyl, 16 head, 63 sec, 512 bytes/sect x 156301488 sectors wd3: 32-bit data port wd3: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100) wd2(hptide0:0:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA) wd3(hptide0:0:1): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA) wd4 at atabus3 drive 0: <WDC WD1600JB-00GVC0> wd4: drive supports 16-sector PIO transfers, LBA48 addressing wd4: 149 GB, 310101 cyl, 16 head, 63 sec, 512 bytes/sect x 312581808 sectors wd4: 32-bit data port wd4: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100) wd4(hptide0:1:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA) st0 at scsibus0 target 6 lun 0: <ARCHIVE, Python 02635-XXX, 596A> tape removable st0: drive empty st0: sync (100.00ns offset 15), 8-bit (10.000MB/s) transfers raid0: RAID Level 1 raid0: Components: /dev/wd0a /dev/wd1a raid0: Total Sectors: 60058496 (29325 MB) boot device: raid0 root on raid0a dumps on raid0b root file system type: ffs cpu1: CPU 1 running wsdisplay0: screen 1 added (80x25, vt100 emulation) wsdisplay0: screen 2 added (80x25, vt100 emulation) wsdisplay0: screen 3 added (80x25, vt100 emulation) wsdisplay0: screen 4 added (80x25, vt100 emulation) [EOF] |
|
|
Re: SMP stability issuesOn Sun, Nov 12, 2006 at 01:23:34PM +1100, Chris Rendle-Short wrote:
> Well, I just tried running GENERIC.MPACPI like some of the others suggested, > however it is still locking up. Here is the dmesg from GENERIC.MPACPI > (although it looks like I might need to check my ACPI configuration in the > BIOS. It looks kike it's using ACPI > I will also try a kernel with DIAGNOSTIC, DEBUG and LOCKDEBUG enabled > as you suggested. Is it likely to matter whether or not ACPI is enabled in > the test kernel? Yes, these checks are independant from ACPI vs MPBIOS > pchb0 at pci0 dev 0 function 0 > pchb0: VIA Technologies VT82C691 (Apollo Pro) Host-PCI (rev. 0xc4) OK, this is the same motherboard as I have here (I have several of theses). I also have issues with them, I guess the debug options will show you that the CPU is missing IPI interrupts on occasion. If so, the attached patch should help (my boxes are rock solid with this patch). Note that it's only active if you have options DIAGNOSTIC in your kernel config. Acutally I suspect this is a bug in the chipset; I have Intel-based dual-PIII motherboards which don't have this issue, nor do P4 SMP systems. -- Manuel Bouyer <bouyer@...> NetBSD: 26 ans d'experience feront toujours la difference -- Index: i386/pmap.c =================================================================== RCS file: /cvsroot/src/sys/arch/i386/i386/pmap.c,v retrieving revision 1.181.2.2 diff -u -r1.181.2.2 pmap.c --- i386/pmap.c 26 Sep 2005 20:24:52 -0000 1.181.2.2 +++ i386/pmap.c 12 Nov 2006 10:42:15 -0000 @@ -3652,6 +3652,7 @@ int s; #ifdef DIAGNOSTIC int count = 0; + int ipi_retry = 0; #endif #endif @@ -3672,6 +3673,9 @@ /* * Send the TLB IPI to other CPUs pending shootdowns. */ +#ifdef DIAGNOSTIC +ipi_again: +#endif for (CPU_INFO_FOREACH(cii, ci)) { if (ci == self) continue; @@ -3683,9 +3687,20 @@ while (self->ci_tlb_ipi_mask != 0) { #ifdef DIAGNOSTIC - if (count++ > 10000000) + if (count++ > 10000000) { + for (CPU_INFO_FOREACH(cii, ci)) { + if (ci == self) + continue; + printf("CPU %ld interrupt level 0x%x pending " + "0x%x depth %d ci_ipis %d\n", ci->ci_cpuid, + ci->ci_ilevel, ci->ci_ipending, + ci->ci_idepth, ci->ci_ipis); + } + if (ipi_retry++ < 5) + goto ipi_again; panic("TLB IPI rendezvous failed (mask %x)", self->ci_tlb_ipi_mask); + } #endif x86_pause(); } Index: isa/npx.c =================================================================== RCS file: /cvsroot/src/sys/arch/i386/isa/npx.c,v retrieving revision 1.107.4.1 diff -u -r1.107.4.1 npx.c --- isa/npx.c 12 May 2006 15:41:46 -0000 1.107.4.1 +++ isa/npx.c 12 Nov 2006 10:42:16 -0000 @@ -752,6 +752,8 @@ } else { #ifdef DIAGNOSTIC int spincount; + int ipi_retry = 0; +ipi_again: #endif IPRINTF(("%s: fp ipi to %s %s lwp %p\n", @@ -770,6 +772,16 @@ #ifdef DIAGNOSTIC spincount++; if (spincount > 10000000) { + printf("CPU %ld interrupt level 0x%x pending " + "0x%x depth %d ci_ipis %d\n", ci->ci_cpuid, + ci->ci_ilevel, ci->ci_ipending, + ci->ci_idepth, ci->ci_ipis); + printf("CPU %ld interrupt level 0x%x pending " + "0x%x depth %d ci_ipis %d\n", oci->ci_cpuid, + oci->ci_ilevel, oci->ci_ipending, + oci->ci_idepth, oci->ci_ipis); + if (ipi_retry++ < 5) + goto ipi_again; panic("fp_save ipi didn't"); } #endif |
|
|
Re: SMP stability issuesHi,
I'm not saying this is your problem, but a few years ago I had a VP6 that worked in single CPU mode but not dual with NetBSD 1.6. Eventually, after having it lock up periodically, I gave up and ran it as a single CPU machine. That lasted a couple of months before the CPU0 socket failed entirely and the machine stopped booting at all. No smoke. No noise. The board/cpu was just plain bad, and there was no indication before it let go for good that this was the case. Byron |
|
|
Re: SMP stability issuesOn Sun, Nov 12, 2006 at 08:55:21AM -0800, Byron Servies wrote:
> Hi, > > I'm not saying this is your problem, but a few years ago I had a VP6 > that worked in single CPU mode but not dual with NetBSD 1.6. > Eventually, after having it lock up periodically, I gave up and ran > it as a single CPU machine. That lasted a couple of months before > the CPU0 socket failed entirely and the machine stopped booting at all. > > No smoke. No noise. The board/cpu was just plain bad, and there was > no indication before it let go for good that this was the case. I don't think it's a broken board in my case; all the apollo-pro based dual-PIII motherboard I tried shows this behavior. -- Manuel Bouyer <bouyer@...> NetBSD: 26 ans d'experience feront toujours la difference -- |
|
|
Re: SMP stability issuesOn Sun, 12 Nov 2006 08:55:21 -0800, Byron Servies <bservies@...> wrote:
> Hi, > > I'm not saying this is your problem, but a few years ago I had a VP6 > that worked in single CPU mode but not dual with NetBSD 1.6. > Eventually, after having it lock up periodically, I gave up and ran > it as a single CPU machine. That lasted a couple of months before > the CPU0 socket failed entirely and the machine stopped booting at all. > > No smoke. No noise. The board/cpu was just plain bad, and there was > no indication before it let go for good that this was the case. > > Byron I'm hoping it's not something like that. I don't think it is, because it would be a bit of an unfortunate coincidence that the motherboard started to fail at the same time as I switch it from Linux to NetBSD. Currently running GENERIC.MPDEBUG, hasn't locked up yet. Chris. |
|
|
Re: SMP stability issuesWell, I've spent the last week and a bit trying different kernels etc, and nothing seems to be working. I've tried those Via kernel patches, both on with ACPI enabled and disabled in the kernel, but all to no avail.
Long story short, I've hit a brick wall. I think I'm going to have to put it down to an unfortunately-timed hardware failure. The machine was moved around a bit when it was making the transition from Linux to NetBSD, so there could be something there. Anyway, thanks to those that helped and offered suggestions. I'll run the machine on one CPU until I can work out something further. Thanks, Chris. |
|
|
Re: SMP stability issues-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1 Chris Rendle-Short wrote: > Anyway, thanks to those that helped and offered suggestions. I'll run the machine on one CPU until I can work out something further. My dual amd box is crashing during a full os build about 1 in 10 times now, with a kernel made a week ago. A kernel made 3 months ago, running on another machine with the same hardware, does not do this. Worse, the same smp-kernel running on a single CPU box also crashes / locks up, while a non-smp does not on the same hardware. - --Michael -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (MingW32) iD8DBQFFZNwtuzMQWQwZDN0RAoeMAJ9yxYtuHIwfiyEMUp7rso7PeOtGegCfaEvL 3jrF9oCfWUwaH7yeXRTCPnc= =UWax -----END PGP SIGNATURE----- |
| Free embeddable forum powered by Nabble | Forum Help |