SMP stability issues

View: New views
20 Messages — Rating Filter:   Alert me  

SMP stability issues

by Chris Rendle-Short :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

For the last couple of months I've been running NetBSD 3.0.1 and 3.1 (since yesterday) on an Abit VP6 SMP motherboard with two P3 866's. The system is mainly used as a mail, web, and Samba server, along with occasional other odd tasks.

When I run off the GENERIC kernel, the machine is rock solid stable. However, when I use either GENERIC.MP or my own kernel (which is basically GENERIC.MP with pcmcia and sound support removed), it invariably locks up after a time running. It is a hard lockup, nothing will revive it other than hitting the reset switch.

The uptime before the lockup has so far varied between about 1 hour and 6 days. There doesn't seem to be any pattern to it, other than the fact that it only happens when running an SMP kernel. I can't find anything in the logs to give any clues.

I'm pretty sure it's not a hardware fault, as I've tested everything I can think of. Added to that, prior to running NetBSD the box ran Linux (in SMP mode) without any problems (uptime was 193 days when I took it down to install NetBSD). The root filesystem is on RAIDFrame, if it makes any difference.

Does anyone have any ideas about what could be causing this, or any troubleshooting clues? Needless to say, it's a very irritating problem.

Thanks in advance,
Chris.


Re: SMP stability issues

by Hubert Feyrer-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, 10 Nov 2006, Chris Rendle-Short wrote:
> Does anyone have any ideas about what could be causing this, or any
> troubleshooting clues? Needless to say, it's a very irritating problem.

What kernel are you seeing these instabilities with?


- Hubert

Re: SMP stability issues

by Chris Rendle-Short :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



On Fri, 10 Nov 2006 09:48:16 +0100 (CET), Hubert Feyrer <hubert@...> wrote:
> On Fri, 10 Nov 2006, Chris Rendle-Short wrote:
>> Does anyone have any ideas about what could be causing this, or any
>> troubleshooting clues? Needless to say, it's a very irritating problem.
>
> What kernel are you seeing these instabilities with?
>
>
> - Hubert

I am getting the lockups with GENERIC.MP. GENERIC has no problems.

Chris.


Re: SMP stability issues

by Hubert Feyrer-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, 10 Nov 2006, Chris Rendle-Short wrote:
>> What kernel are you seeing these instabilities with?
>
> I am getting the lockups with GENERIC.MP. GENERIC has no problems.

From what NetBSD Version - latest -current, 3.1, ...?


  - Hubert

Re: SMP stability issues

by Chris Rendle-Short :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



On Fri, 10 Nov 2006 10:13:39 +0100 (CET), Hubert Feyrer <hubert@...> wrote:
> On Fri, 10 Nov 2006, Chris Rendle-Short wrote:
>>> What kernel are you seeing these instabilities with?
>>
>> I am getting the lockups with GENERIC.MP. GENERIC has no problems.
>
> From what NetBSD Version - latest -current, 3.1, ...?
>
>
>   - Hubert

I first installed from 3.0.1, and have recently upgraded to 3.1 by following netbsd-3. The stability problem has occurred in both 3.0.1 and 3.1

Chris.


Re: SMP stability issues

by SODA Noriyuki :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>>>>> On Fri, 10 Nov 2006 21:10:56 +1100,
      Chris Rendle-Short <jim@...> said:

> I first installed from 3.0.1, and have recently upgraded to 3.1 by
> following netbsd-3. The stability problem has occurred in both 3.0.1
> and 3.1

Have you tried some hardware diag tool like memtest86+?

We recently had a stability issue, and we didn't think it as a
hardware problem, since the machine had worked fine until we
replaced its kernel. (The machine also did run memtest86+ fine,
when it was bought).
But actually it was a hardware issue in our case.  When we tried
memtest86+ finally, lots of RAM problems were detected.

FWIW, 3.x kernel run fine with GENERIC.MPACPI configuration
on my Athlon 64 X2.
--
soda

Re: SMP stability issues

by Chris Rendle-Short :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



On Fri, 10 Nov 2006 19:22:40 +0900, SODA Noriyuki <soda@...> wrote:

>
> Have you tried some hardware diag tool like memtest86+?
>
> We recently had a stability issue, and we didn't think it as a
> hardware problem, since the machine had worked fine until we
> replaced its kernel. (The machine also did run memtest86+ fine,
> when it was bought).
> But actually it was a hardware issue in our case.  When we tried
> memtest86+ finally, lots of RAM problems were detected.
>
> FWIW, 3.x kernel run fine with GENERIC.MPACPI configuration
> on my Athlon 64 X2.
> --
> soda

Yes, I've run memtest86 and memtest86+ and found no errors. I've also checked all the HDDs with MHDD (http://hddguru.com/content/en/software/2005.10.02-MHDD/), and found no problems.

I haven't tried GENERIC.MPACPI yet. As I understood it is mainly intended for duel core systems and systems with HT. As mine doesn't have HT and is physically two seperate CPUs, I went with GENERIC.MP.

I have also noticed a SWINGER kernel config file, with an accompanying SWINGER.MP. I'm not too certain why SWINGER is included in the source tree, but it is descirbed as "thorpej's Abit BP6+dual Celeron". Interesting because the motherboard I am using, an Abit VP6, is the successor to the BP6. I can't find anything special about SWINGER though, except that it is customised to only work on the BP6.

Chris.


Re: SMP stability issues

by Perry E. Metzger :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Chris Rendle-Short <jim@...> writes:
> I haven't tried GENERIC.MPACPI yet. As I understood it is mainly
> intended for duel core systems and systems with HT.

Not really -- it is for machines that have ACPI. In general, I've been
finding of late that many ACPI supporting boxes just don't run right
if you don't use ACPI...

Perry

Re: SMP stability issues

by Patrick Welche :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, Nov 10, 2006 at 01:16:52PM -0500, Perry E. Metzger wrote:
>
> Chris Rendle-Short <jim@...> writes:
> > I haven't tried GENERIC.MPACPI yet. As I understood it is mainly
> > intended for duel core systems and systems with HT.
>
> Not really -- it is for machines that have ACPI. In general, I've been
> finding of late that many ACPI supporting boxes just don't run right
> if you don't use ACPI...

.. to the extent of often needing to build custom INSTALL kernels
with acpi on i386..

Cheers,

Patrick

Re: SMP stability issues

by Perry E. Metzger :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Patrick Welche <prlw1@...> writes:

> On Fri, Nov 10, 2006 at 01:16:52PM -0500, Perry E. Metzger wrote:
>> Chris Rendle-Short <jim@...> writes:
>> > I haven't tried GENERIC.MPACPI yet. As I understood it is mainly
>> > intended for duel core systems and systems with HT.
>>
>> Not really -- it is for machines that have ACPI. In general, I've been
>> finding of late that many ACPI supporting boxes just don't run right
>> if you don't use ACPI...
>
> .. to the extent of often needing to build custom INSTALL kernels
> with acpi on i386..

Indeed. I've had horrible instability trying to do installs without
ACPI on a few times.

Anyway, I think the general message is "if your machine has ACPI,
try things with ACPI on if you're having problems -- manufacturers
aren't testing non-ACPI very well any more."

Perry

Re: SMP stability issues

by Lars Nordlund :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, 10 Nov 2006 19:27:22 +1100
Chris Rendle-Short <jim@...> wrote:
> Hi,
>
> For the last couple of months I've been running NetBSD 3.0.1 and 3.1 (since yesterday) on an Abit VP6 SMP motherboard with two P3 866's. The system is mainly used as a mail, web, and Samba server, along with occasional other odd tasks.

Ah, the good old Abit VP6 motherboard. I have one myself. It was very
cheap and stable. At least until the capacitors started to explode.. I
had them replaced but it was never quite as stable as it once had been.
I could not get it to draw graphics without hanging rock solid after a
few seconds. No matter what graphics card I used (or OS). I also tried
switching power supply but it did not help. In textmode it was kind of
ok. Could multi-job compile for several days without showing any
problems.

These days one of the CPUs is doing its duty in another (non-SMP)
motherboard, and the other is resting peacefully in the ever growing
pile of old junk in my home lab..


Best regards,
        Lars Nordlund

Re: SMP stability issues

by Chris Rendle-Short :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



On Fri, 10 Nov 2006 22:14:47 +0100, Lars Nordlund <lars.nordlund@...> wrote:

>
> Ah, the good old Abit VP6 motherboard. I have one myself. It was very
> cheap and stable. At least until the capacitors started to explode.. I
> had them replaced but it was never quite as stable as it once had been.
> I could not get it to draw graphics without hanging rock solid after a
> few seconds. No matter what graphics card I used (or OS). I also tried
> switching power supply but it did not help. In textmode it was kind of
> ok. Could multi-job compile for several days without showing any
> problems.
>
> These days one of the CPUs is doing its duty in another (non-SMP)
> motherboard, and the other is resting peacefully in the ever growing
> pile of old junk in my home lab..
>
>
> Best regards,
> Lars Nordlund

Yes, I had capacitor problems too. It drove me crazy trying to work out what was going on until I read about the Capacitor Problem (http://www.dashdist.com/1u2u/company/capacitor.html). I replaced them all, and it has been perfectly stable ever since. Well, under Linux anyway (and NetBSD on one CPU).


On Fri, 10 Nov 2006 13:52:55 -0500, "Perry E. Metzger" <perry@...> wrote:

>
>>> Not really -- it is for machines that have ACPI. In general, I've been
>>> finding of late that many ACPI supporting boxes just don't run right
>>> if you don't use ACPI...
>>
>> .. to the extent of often needing to build custom INSTALL kernels
>> with acpi on i386..
>
> Indeed. I've had horrible instability trying to do installs without
> ACPI on a few times.
>
> Anyway, I think the general message is "if your machine has ACPI,
> try things with ACPI on if you're having problems -- manufacturers
> aren't testing non-ACPI very well any more."
>
> Perry

Ah, now this is interesting. I did not realise that there was problems running a non-ACPI kernel on an ACPI system.

Thanks for the info guys, I will build GENERIC.MPACPI and see how it goes.

Chris.


Re: SMP stability issues

by Manuel Bouyer :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, Nov 10, 2006 at 07:27:22PM +1100, Chris Rendle-Short wrote:

> Hi,
>
> For the last couple of months I've been running NetBSD 3.0.1 and 3.1 (since yesterday) on an Abit VP6 SMP motherboard with two P3 866's. The system is mainly used as a mail, web, and Samba server, along with occasional other odd tasks.
>
> When I run off the GENERIC kernel, the machine is rock solid stable. However, when I use either GENERIC.MP or my own kernel (which is basically GENERIC.MP with pcmcia and sound support removed), it invariably locks up after a time running. It is a hard lockup, nothing will revive it other than hitting the reset switch.
>
> The uptime before the lockup has so far varied between about 1 hour and 6 days. There doesn't seem to be any pattern to it, other than the fact that it only happens when running an SMP kernel. I can't find anything in the logs to give any clues.
>
> I'm pretty sure it's not a hardware fault, as I've tested everything I can think of. Added to that, prior to running NetBSD the box ran Linux (in SMP mode) without any problems (uptime was 193 days when I took it down to install NetBSD). The root filesystem is on RAIDFrame, if it makes any difference.
>
> Does anyone have any ideas about what could be causing this, or any troubleshooting clues? Needless to say, it's a very irritating problem.

What chipset does this motherboard have ? Can you post the dmesg ?

Also, you could try to build a kernel with DIAGNOSTIC, DEBUG and LOCKDEBUG
options. A hard hang like that could be a deadlock in the kernel;
one of these options may help to find what's going on.

--
Manuel Bouyer <bouyer@...>
     NetBSD: 26 ans d'experience feront toujours la difference
--

Re: SMP stability issues

by Chris Rendle-Short :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, 12 Nov 2006 00:11:39 +0100, Manuel Bouyer <bouyer@...> wrote:

> What chipset does this motherboard have ? Can you post the dmesg ?
>
> Also, you could try to build a kernel with DIAGNOSTIC, DEBUG and LOCKDEBUG
> options. A hard hang like that could be a deadlock in the kernel;
> one of these options may help to find what's going on.
>
> --
> Manuel Bouyer <bouyer@...>
>      NetBSD: 26 ans d'experience feront toujours la difference
> --

Well, I just tried running GENERIC.MPACPI like some of the others suggested, however it is still locking up. Here is the dmesg from GENERIC.MPACPI (although it looks like I might need to check my ACPI configuration in the BIOS. I will also try a kernel with DIAGNOSTIC, DEBUG and LOCKDEBUG enabled as you suggested. Is it likely to matter whether or not ACPI is enabled in the test kernel?

Thanks,
Chris.


NetBSD 3.1_STABLE (GENERIC.MPACPI) #0: Sat Nov 11 10:18:33 EST 2006
        jim@...:/usr/src/sys/arch/i386/compile/GENERIC.MPACPI
total memory = 511 MB
avail memory = 492 MB
BIOS32 rev. 0 found at 0xfb340
mainbus0 (root)
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel Pentium III (686-class), 865.29 MHz, id 0x686
cpu0: features 387fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features 387fbff<PGE,MCA,CMOV,PAT,PSE36,PN,MMX>
cpu0: features 387fbff<FXSR,SSE>
cpu0: I-cache 16 KB 32B/line 4-way, D-cache 16 KB 32B/line 4-way
cpu0: L2 cache 256 KB 32B/line 8-way
cpu0: ITLB 32 4 KB entries 4-way, 2 4 MB entries fully associative
cpu0: DTLB 64 4 KB entries 4-way, 8 4 MB entries 4-way
cpu0: serial number 0000-0686-0003-8754-64B0-C402
cpu0: calibrating local timer
cpu0: apic clock running at 133 MHz
cpu0: 8 page colors
cpu1 at mainbus0: apid 1 (application processor)
cpu1: starting
cpu1: Intel Pentium III (686-class), 865.25 MHz, id 0x686
cpu1: features 387fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu1: features 387fbff<PGE,MCA,CMOV,PAT,PSE36,PN,MMX>
cpu1: features 387fbff<FXSR,SSE>
cpu1: I-cache 16 KB 32B/line 4-way, D-cache 16 KB 32B/line 4-way
cpu1: L2 cache 256 KB 32B/line 8-way
cpu1: ITLB 32 4 KB entries 4-way, 2 4 MB entries fully associative
cpu1: DTLB 64 4 KB entries 4-way, 8 4 MB entries 4-way
cpu1: serial number 0000-0686-0003-FE6E-375F-F1CD
ioapic0 at mainbus0 apid 2 (I/O APIC)
ioapic0: pa 0xfec00000, version 11, 24 pins
acpi0 at mainbus0
acpi0: using Intel ACPI CA subsystem version 20040211
acpi0: X/RSDT: OemId <VIA694,AWRDACPI,42302e31>, AslId <AWRD,00000000>
acpi0: SCI interrupting at int 9
acpi0: fixed-feature power button present
mpacpi: could not get bus number, assuming bus 0
ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
ACPI Object Type 'Processor' (0x0c) at acpi0 not configured
acpibut0 at acpi0 (PNP0C0C): ACPI Power Button
PNP0C01 [System Board] at acpi0 not configured
PNP0A03 [PCI Bus] at acpi0 not configured
PNP0C0F [PCI interrupt link device] at acpi0 not configured
PNP0C0F [PCI interrupt link device] at acpi0 not configured
PNP0C0F [PCI interrupt link device] at acpi0 not configured
PNP0C02 [Plug and Play motherboard register resources] at acpi0 not configured
PNP0000 [AT Interrupt Controller] at acpi0 not configured
PNP0200 [AT DMA Controller] at acpi0 not configured
PNP0100 [AT Timer] at acpi0 not configured
PNP0B00 [AT Real-Time Clock] at acpi0 not configured
PNP0800 [AT-style speaker sound] at acpi0 not configured
npx0 at acpi0 (PNP0C04)
npx0: io 0xf0-0xff irq 13
npx0: using exception 16
com0 at acpi0 (PNP0501-1)
com0: io 0x3f8-0x3ff irq 4
com0: ns16550a, working fifo
com1 at acpi0 (PNP0501-2)
com1: io 0x2f8-0x2ff irq 3
com1: ns16550a, working fifo
lpt0 at acpi0 (PNP0401)
lpt0: io 0x378-0x37f,0x778-0x77b irq 7 drq 3
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
pchb0 at pci0 dev 0 function 0
pchb0: VIA Technologies VT82C691 (Apollo Pro) Host-PCI (rev. 0xc4)
agp0 at pchb0: aperture at 0xd0000000, size 0xf000000
ppb0 at pci0 dev 1 function 0: VIA Technologies VT82C598 (Apollo MVP3) CPU-AGP Bridge (rev. 0x00)
pci1 at ppb0 bus 1
pci1: i/o space, memory space enabled
vga0 at pci1 dev 0 function 0: Silicon Integrated System 6326 AGP VGA (rev. 0x0b)
wsdisplay0 at vga0 kbdmux 1: console (80x25, vt100 emulation)
wsmux1: connecting to wsdisplay0
pcib0 at pci0 dev 7 function 0
pcib0: VIA Technologies VT82C686A PCI-ISA Bridge (rev. 0x40)
viaide0 at pci0 dev 7 function 1
viaide0: VIA Technologies VT82C686A (Apollo KX133) ATA100 controller
viaide0: bus-master DMA support present
viaide0: primary channel configured to compatibility mode
viaide0: primary channel interrupting at ioapic0 pin 14 (irq 14)
atabus0 at viaide0 channel 0
viaide0: secondary channel configured to compatibility mode
viaide0: secondary channel interrupting at ioapic0 pin 15 (irq 15)
atabus1 at viaide0 channel 1
uhci0 at pci0 dev 7 function 2: VIA Technologies VT83C572 USB Controller (rev. 0x16)
uhci0: interrupting at ioapic0 pin 12 (irq 12)
usb0 at uhci0: USB revision 1.0
uhub0 at usb0
uhub0: VIA Technologies UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhci1 at pci0 dev 7 function 3: VIA Technologies VT83C572 USB Controller (rev. 0x16)
uhci1: interrupting at ioapic0 pin 12 (irq 12)
usb1 at uhci1: USB revision 1.0
uhub1 at usb1
uhub1: VIA Technologies UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
VIA Technologies VT82C686A SMBus Controller (miscellaneous bridge, revision 0x40) at pci0 dev 7 function 4 not configured
adv1 at pci0 dev 9 function 0: AdvanSys ABP-9xxU SCSI adapter
adv1: interrupting at ioapic0 pin 16 (irq 11)
scsibus0 at adv1: 8 targets, 8 luns per target
ex0 at pci0 dev 13 function 0: 3Com 3cSOHO100-TX 10/100 Ethernet (rev. 0x30)
ex0: interrupting at ioapic0 pin 18 (irq 10)
ex0: MAC address 00:04:76:36:cf:be
ex0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, default 10baseT
hptide0 at pci0 dev 14 function 0
hptide0: Triones/Highpoint HPT370 IDE Controller
hptide0: bus-master DMA support present
hptide0: primary channel wired to native-PCI mode
hptide0: using ioapic0 pin 18 (irq 10) for native-PCI interrupt
atabus2 at hptide0 channel 0
hptide0: secondary channel wired to native-PCI mode
atabus3 at hptide0 channel 1
isa0 at pcib0
pcppi0 at isa0 port 0x61
midi0 at pcppi0: PC speaker
sysbeep0 at pcppi0
isapnp0 at isa0 port 0x279: ISA Plug 'n Play device support
isapnp0: no ISA Plug 'n Play devices found
ioapic0: enabling
Kernelized RAIDframe activated
scsibus0: waiting 2 seconds for devices to settle...
wd0 at atabus0 drive 0: <Maxtor 6E030L0>
wd0: drive supports 16-sector PIO transfers, LBA addressing
wd0: 29325 MB, 59582 cyl, 16 head, 63 sec, 512 bytes/sect x 60058656 sectors
wd0: 32-bit data port
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd0(viaide0:0:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA)
wd1 at atabus1 drive 0: <Maxtor 6E030L0>
wd1: drive supports 16-sector PIO transfers, LBA addressing
wd1: 29325 MB, 59582 cyl, 16 head, 63 sec, 512 bytes/sect x 60058656 sectors
wd1: 32-bit data port
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd1(viaide0:1:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA)
wd2 at atabus2 drive 0: <WDC WD800BB-00FJA0>
wd2: drive supports 16-sector PIO transfers, LBA addressing
wd2: 76319 MB, 155061 cyl, 16 head, 63 sec, 512 bytes/sect x 156301488 sectors
wd2: 32-bit data port
wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd3 at atabus2 drive 1: <WDC WD800BB-00JHA0>
wd3: drive supports 16-sector PIO transfers, LBA addressing
wd3: 76319 MB, 155061 cyl, 16 head, 63 sec, 512 bytes/sect x 156301488 sectors
wd3: 32-bit data port
wd3: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd2(hptide0:0:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA)
wd3(hptide0:0:1): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA)
wd4 at atabus3 drive 0: <WDC WD1600JB-00GVC0>
wd4: drive supports 16-sector PIO transfers, LBA48 addressing
wd4: 149 GB, 310101 cyl, 16 head, 63 sec, 512 bytes/sect x 312581808 sectors
wd4: 32-bit data port
wd4: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd4(hptide0:1:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA)
st0 at scsibus0 target 6 lun 0: <ARCHIVE, Python 02635-XXX, 596A> tape removable
st0: drive empty
st0: sync (100.00ns offset 15), 8-bit (10.000MB/s) transfers
raid0: RAID Level 1
raid0: Components: /dev/wd0a /dev/wd1a
raid0: Total Sectors: 60058496 (29325 MB)
boot device: raid0
root on raid0a dumps on raid0b
root file system type: ffs
cpu1: CPU 1 running
wsdisplay0: screen 1 added (80x25, vt100 emulation)
wsdisplay0: screen 2 added (80x25, vt100 emulation)
wsdisplay0: screen 3 added (80x25, vt100 emulation)
wsdisplay0: screen 4 added (80x25, vt100 emulation)

[EOF]


Re: SMP stability issues

by Manuel Bouyer :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, Nov 12, 2006 at 01:23:34PM +1100, Chris Rendle-Short wrote:
> Well, I just tried running GENERIC.MPACPI like some of the others suggested,
> however it is still locking up. Here is the dmesg from GENERIC.MPACPI
> (although it looks like I might need to check my ACPI configuration in the
> BIOS.

It looks kike it's using ACPI

> I will also try a kernel with DIAGNOSTIC, DEBUG and LOCKDEBUG enabled
> as you suggested. Is it likely to matter whether or not ACPI is enabled in
> the test kernel?

Yes, these checks are independant from ACPI vs MPBIOS

> pchb0 at pci0 dev 0 function 0
> pchb0: VIA Technologies VT82C691 (Apollo Pro) Host-PCI (rev. 0xc4)

OK, this is the same motherboard as I have here (I have several of theses). I
also have issues with them, I guess the debug options
will show you that the CPU is missing IPI interrupts on occasion.
If so, the attached patch should help (my boxes are rock solid with this
patch). Note that it's only active if you have
options DIAGNOSTIC
in your kernel config.
Acutally I suspect this is a bug in the chipset; I have Intel-based dual-PIII
motherboards which don't have this issue, nor do P4 SMP systems.

--
Manuel Bouyer <bouyer@...>
     NetBSD: 26 ans d'experience feront toujours la difference
--

Index: i386/pmap.c
===================================================================
RCS file: /cvsroot/src/sys/arch/i386/i386/pmap.c,v
retrieving revision 1.181.2.2
diff -u -r1.181.2.2 pmap.c
--- i386/pmap.c 26 Sep 2005 20:24:52 -0000 1.181.2.2
+++ i386/pmap.c 12 Nov 2006 10:42:15 -0000
@@ -3652,6 +3652,7 @@
  int s;
 #ifdef DIAGNOSTIC
  int count = 0;
+ int ipi_retry = 0;
 #endif
 #endif
 
@@ -3672,6 +3673,9 @@
  /*
  * Send the TLB IPI to other CPUs pending shootdowns.
  */
+#ifdef DIAGNOSTIC
+ipi_again:
+#endif
  for (CPU_INFO_FOREACH(cii, ci)) {
  if (ci == self)
  continue;
@@ -3683,9 +3687,20 @@
 
  while (self->ci_tlb_ipi_mask != 0) {
 #ifdef DIAGNOSTIC
- if (count++ > 10000000)
+ if (count++ > 10000000) {
+ for (CPU_INFO_FOREACH(cii, ci)) {
+ if (ci == self)
+ continue;
+ printf("CPU %ld interrupt level 0x%x pending "
+    "0x%x depth %d ci_ipis %d\n", ci->ci_cpuid,
+    ci->ci_ilevel, ci->ci_ipending,
+    ci->ci_idepth, ci->ci_ipis);
+ }
+ if (ipi_retry++ < 5)
+ goto ipi_again;
  panic("TLB IPI rendezvous failed (mask %x)",
     self->ci_tlb_ipi_mask);
+ }
 #endif
  x86_pause();
  }
Index: isa/npx.c
===================================================================
RCS file: /cvsroot/src/sys/arch/i386/isa/npx.c,v
retrieving revision 1.107.4.1
diff -u -r1.107.4.1 npx.c
--- isa/npx.c 12 May 2006 15:41:46 -0000 1.107.4.1
+++ isa/npx.c 12 Nov 2006 10:42:16 -0000
@@ -752,6 +752,8 @@
  } else {
 #ifdef DIAGNOSTIC
  int spincount;
+ int ipi_retry = 0;
+ipi_again:
 #endif
 
  IPRINTF(("%s: fp ipi to %s %s lwp %p\n",
@@ -770,6 +772,16 @@
 #ifdef DIAGNOSTIC
  spincount++;
  if (spincount > 10000000) {
+ printf("CPU %ld interrupt level 0x%x pending "
+    "0x%x depth %d ci_ipis %d\n", ci->ci_cpuid,
+    ci->ci_ilevel, ci->ci_ipending,
+    ci->ci_idepth, ci->ci_ipis);
+ printf("CPU %ld interrupt level 0x%x pending "
+    "0x%x depth %d ci_ipis %d\n", oci->ci_cpuid,
+    oci->ci_ilevel, oci->ci_ipending,
+    oci->ci_idepth, oci->ci_ipis);
+ if (ipi_retry++ < 5)
+ goto ipi_again;
  panic("fp_save ipi didn't");
  }
 #endif

Re: SMP stability issues

by Byron Servies :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I'm not saying this is your problem, but a few years ago I had a VP6  
that worked in single CPU mode but not dual with NetBSD 1.6.  
Eventually, after having it lock up periodically, I gave up and ran  
it as a single CPU machine.  That lasted a couple of months before  
the CPU0 socket failed entirely and the machine stopped booting at all.

No smoke.  No noise.  The board/cpu was just plain bad, and there was  
no indication before it let go for good that this was the case.

Byron


Re: SMP stability issues

by Manuel Bouyer :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, Nov 12, 2006 at 08:55:21AM -0800, Byron Servies wrote:

> Hi,
>
> I'm not saying this is your problem, but a few years ago I had a VP6  
> that worked in single CPU mode but not dual with NetBSD 1.6.  
> Eventually, after having it lock up periodically, I gave up and ran  
> it as a single CPU machine.  That lasted a couple of months before  
> the CPU0 socket failed entirely and the machine stopped booting at all.
>
> No smoke.  No noise.  The board/cpu was just plain bad, and there was  
> no indication before it let go for good that this was the case.

I don't think it's a broken board in my case; all the apollo-pro based
dual-PIII motherboard I tried shows this behavior.

--
Manuel Bouyer <bouyer@...>
     NetBSD: 26 ans d'experience feront toujours la difference
--

Re: SMP stability issues

by Chris Rendle-Short :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, 12 Nov 2006 08:55:21 -0800, Byron Servies <bservies@...> wrote:

> Hi,
>
> I'm not saying this is your problem, but a few years ago I had a VP6
> that worked in single CPU mode but not dual with NetBSD 1.6.
> Eventually, after having it lock up periodically, I gave up and ran
> it as a single CPU machine.  That lasted a couple of months before
> the CPU0 socket failed entirely and the machine stopped booting at all.
>
> No smoke.  No noise.  The board/cpu was just plain bad, and there was
> no indication before it let go for good that this was the case.
>
> Byron

I'm hoping it's not something like that. I don't think it is, because it would be a bit of an unfortunate coincidence that the motherboard started to fail at the same time as I switch it from Linux to NetBSD.

Currently running GENERIC.MPDEBUG, hasn't locked up yet.

Chris.


Re: SMP stability issues

by Chris Rendle-Short :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Well, I've spent the last week and a bit trying different kernels etc, and nothing seems to be working. I've tried those Via kernel patches, both on with ACPI enabled and disabled in the kernel, but all to no avail.

Long story short, I've hit a brick wall. I think I'm going to have to put it down to an unfortunately-timed hardware failure. The machine was moved around a bit when it was making the transition from Linux to NetBSD, so there could be something there.

Anyway, thanks to those that helped and offered suggestions. I'll run the machine on one CPU until I can work out something further.

Thanks,
Chris.


Re: SMP stability issues

by Michael Graff-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Chris Rendle-Short wrote:
> Anyway, thanks to those that helped and offered suggestions. I'll run the machine on one CPU until I can work out something further.

My dual amd box is crashing during a full os build about 1 in 10 times
now, with a kernel made a week ago.

A kernel made 3 months ago, running on another machine with the same
hardware, does not do this.

Worse, the same smp-kernel running on a single CPU box also crashes /
locks up, while a non-smp does not on the same hardware.

- --Michael
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (MingW32)

iD8DBQFFZNwtuzMQWQwZDN0RAoeMAJ9yxYtuHIwfiyEMUp7rso7PeOtGegCfaEvL
3jrF9oCfWUwaH7yeXRTCPnc=
=UWax
-----END PGP SIGNATURE-----