Re: port-alpha/42174

View: New views
10 Messages — Rating Filter:   Alert me  

Parent Message unknown Re: port-alpha/42174

by Dustin Marquess :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The following reply was made to PR port-alpha/42174; it has been noted by GNATS.

From: Dustin Marquess <dmarquess@...>
To: gnats-bugs@...
Cc:
Subject: Re: port-alpha/42174
Date: Sun, 25 Oct 2009 12:45:54 -0500

 I've triggered this bug about 4 times now, and every single time it
 hits, when I reboot the disklabel on the drive is completely gone.
 This seems like a huge coincidence to me.
 

Parent Message unknown Re: port-alpha/42174

by Michael L. Hitch :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The following reply was made to PR port-alpha/42174; it has been noted by GNATS.

From: "Michael L. Hitch" <mhitch@...>
To: gnats-bugs@...
Cc: gnats-admin@..., dmarquess@...
Subject: Re: port-alpha/42174
Date: Mon, 26 Oct 2009 11:50:24 -0600 (MDT)

 On Sat, 17 Oct 2009, Dustin Marquess wrote:
 
 > Got the machine to deadlock again.  Used the halt button to get into
 > ddb and this is the stack trace that I got:
 >
 > db{0}> trace
 > cpu_Debugger() at netbsd:cpu_Debugger+0x4
 > panic() at netbsd:panic+0x244
 > console_restart() at netbsd:console_restart+0x78
 > XentRestart() at netbsd:XentRestart+0x90
 > --- console restart (from ipl 4) ---
 > nullop() at netbsd:nullop
 > _kernel_lock() at netbsd:_kernel_lock+0x1c0
 
     This would indicate that CPU0 is attempting to acquire the kernel lock,
 but presumably CPU1 currently has it and is unable to release it for some
 reasone.
 
    I don't know of any easy way to determine what CPU1 is currently doing.
 It most like is unable to process the IPI interrupt sent by entering DDB
 and pausing.  Even if it was able to, something is broken in looking at a
 backtrace in the other cpus, although I think the register contents can be
 displayed when the cpu is able to respond to the IPI pause request.
 
    A LOCKDEBUG kernel would provide more information on what is currently
 locked, but I don't think that's working on the alpha at the moment.
 
 
 
 --
 Michael L. Hitch mhitch@...
 Computer Consultant
 Information Technology Center
 Montana State University Bozeman, MT USA
 

Parent Message unknown Re: port-alpha/42174

by David Holland-7 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The following reply was made to PR port-alpha/42174; it has been noted by GNATS.

From: David Holland <dholland-bugs@...>
To: gnats-bugs@...
Cc:
Subject: Re: port-alpha/42174
Date: Fri, 30 Oct 2009 06:40:35 +0000

 On Sun, Oct 25, 2009 at 05:55:02PM +0000, Dustin Marquess wrote:
  >  I've triggered this bug about 4 times now, and every single time it
  >  hits, when I reboot the disklabel on the drive is completely gone.
  >  This seems like a huge coincidence to me.
 
 The *disklabel*? Properly speaking that shouldn't be possible - the
 disklabel should be read-only during all normal operation...
 
 --
 David A. Holland
 dholland@...
 

Parent Message unknown Re: port-alpha/42174

by Michael L. Hitch :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The following reply was made to PR port-alpha/42174; it has been noted by GNATS.

From: "Michael L. Hitch" <mhitch@...>
To: gnats-bugs@...
Cc: dmarquess@...
Subject: Re: port-alpha/42174
Date: Sun, 1 Nov 2009 20:36:38 -0700 (MST)

 On Mon, 26 Oct 2009, Michael L. Hitch wrote:
 
 >    A LOCKDEBUG kernel would provide more information on what is currently
 > locked, but I don't think that's working on the alpha at the moment.
 
    I've committed a fix for the LOCKDEBUG problem on the CS20/DS20L, and it
 has been pulled up to the netbsd-5 branch.  One of the things that
 LOCKDEBUG will do is to limit how long the cpu spins waiting for a lock,
 so should eventually panic instead of hanging forever deadlocked.  It also
 keeps more information about the locks, including which cpu holds the lock
 and the address where the lock was acquired.
 
 
 --
 Michael L. Hitch mhitch@...
 Computer Consultant
 Information Technology Center
 Montana State University Bozeman, MT USA
 

Parent Message unknown Re: port-alpha/42174

by Dustin Marquess :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The following reply was made to PR port-alpha/42174; it has been noted by GNATS.

From: Dustin Marquess <dmarquess@...>
To: "Michael L. Hitch" <mhitch@...>
Cc: gnats-bugs@...
Subject: Re: port-alpha/42174
Date: Mon, 2 Nov 2009 11:42:08 -0600

 I just hit:
 
 panic: kernel diagnostic assertion "pcg->pcg_avail == 0" failed: file
 "../../../../kern/subr_pool.c", line 2580
 Stopped in pid 0.12 (system) at netbsd:cpu_Debugger+0x4:        ret     zero,(ra
 )
 db{1}> bt
 cpu_Debugger() at netbsd:cpu_Debugger+0x4
 panic() at netbsd:panic+0x244
 __kernassert() at netbsd:__kernassert+0x4c
 pool_cache_put_slow() at netbsd:pool_cache_put_slow+0x2f0
 pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x1b0
 pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0x174
 alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
 interrupt() at netbsd:interrupt+0x84
 XentInt() at netbsd:XentInt+0x1c
 --- interrupt (from ipl 0) ---
 idle_loop() at netbsd:idle_loop+0x19c
 cpu_spinup_trampoline() at netbsd:cpu_spinup_trampoline+0x5c
 --- root of call graph ---
 
 Not sure if that's related to DIAGNOSTIC or LOCKDEBUG.  If it's
 related to DIAGNOSTIC I'll recompile with that disabled for now.
 

Parent Message unknown Re: port-alpha/42174

by Michael L. Hitch :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The following reply was made to PR port-alpha/42174; it has been noted by GNATS.

From: "Michael L. Hitch" <mhitch@...>
To: gnats-bugs@...
Cc: dmarquess@...
Subject: Re: port-alpha/42174
Date: Mon, 2 Nov 2009 11:22:33 -0700 (MST)

 On Mon, 2 Nov 2009, Dustin Marquess wrote:
 
 > I just hit:
 >
 > panic: kernel diagnostic assertion "pcg->pcg_avail == 0" failed: file
 > "../../../../kern/subr_pool.c", line 2580
 
    Hmm, I wonder if this might be related to some of the pool cache
 corruption I've seen at times.
 
 > Not sure if that's related to DIAGNOSTIC or LOCKDEBUG.  If it's
 > related to DIAGNOSTIC I'll recompile with that disabled for now.
 
    The KASSERT() is enabled by DIAGNOSTIC, which I now see isn't enabled
 in my kernel and would explain why I haven't seen this happen.  I
 would presume that the condition being tested shouldn't have happened,
 and ignoring it by turning off DIAGNOSTIC would likely just result in
 problems later.
 
    There's a couple of things you can try that essentially disable the
 use of using pool cache entries to shootdown tlbs.  One is a variable
 you can patch on boot or change in the source to disable all the pool
 caching.  The variable is pool_cache_disable in sys/kern/subr_pool.c
 about line 182.  This can also be set temporarily on boot by using
 the -d flag to enter the debugger on startup and setting it with the
 command "write pool_cache_disable 1" and continuing.
 
    The other thing is to change the maximum number of TLB entries to
 shootdown from 6 to 0 in sys/arch/alpha/alpha/pmap.c, about line 433.
 This will force the tlb shootdown routine to always invalid all tlb
 entries and never attempt to use pool cache entries.
 
 --
 Michael L. Hitch mhitch@...
 Computer Consultant
 Information Technology Center
 Montana State University Bozeman, MT USA
 

Parent Message unknown Re: port-alpha/42174

by Dustin Marquess :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The following reply was made to PR port-alpha/42174; it has been noted by GNATS.

From: Dustin Marquess <dmarquess@...>
To: "Michael L. Hitch" <mhitch@...>
Cc: gnats-bugs@...
Subject: Re: port-alpha/42174
Date: Tue, 3 Nov 2009 20:35:52 -0600

 Okay, got it back up and running and within minutes hit:
 
 Mutex error: mutex_vector_enter: spinout
 
 lock address : 0xfffffc0000c85fe8 type     :               spin
 initialized  : 0xfffffc00006eb9b8
 shared holds :                  0 exclusive:                  0
 shares wanted:                  0 exclusive:                  3
 current cpu  :                  0 last held:                  1
 current lwp  : 0xfffffc007f69d800 last held: 000000000000000000
 last locked  : 0xfffffc00006eccc8 unlocked : 0xfffffc00006ece90
 owner field  : 0x0000000000000400 wait/spin:                0/1
 
 panic: LOCKDEBUG
 Stopped in pid 0.36 (system) at netbsd:cpu_Debugger+0x4:        ret     zero,(ra
 )
 db{0}> bt
 cpu_Debugger() at netbsd:cpu_Debugger+0x4
 panic() at netbsd:panic+0x244
 lockdebug_abort1() at netbsd:lockdebug_abort1+0x150
 mutex_abort() at netbsd:mutex_abort+0x34
 mutex_vector_enter() at netbsd:mutex_vector_enter+0x324
 pool_cache_put_slow() at netbsd:pool_cache_put_slow+0x198
 pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x1b0
 pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0x174
 alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
 interrupt() at netbsd:interrupt+0x84
 XentInt() at netbsd:XentInt+0x1c
 --- interrupt (from ipl 4) ---
 lockdebug_wantlock() at netbsd:lockdebug_wantlock+0x218
 mutex_tryenter() at netbsd:mutex_tryenter+0x274
 pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x78
 pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x26c
 pmap_tlb_shootdown() at netbsd:pmap_tlb_shootdown+0x248
 pmap_kremove() at netbsd:pmap_kremove+0x124
 uvm_pagermapout() at netbsd:uvm_pagermapout+0x40
 uvm_aio_aiodone() at netbsd:uvm_aio_aiodone+0xf4
 
 I'm still in ddb if I need to do anything else.
 

Parent Message unknown Re: port-alpha/42174

by Michael L. Hitch :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The following reply was made to PR port-alpha/42174; it has been noted by GNATS.

From: "Michael L. Hitch" <mhitch@...>
To: gnats-bugs@...
Cc: dmarquess@...
Subject: Re: port-alpha/42174
Date: Thu, 5 Nov 2009 11:59:39 -0700 (MST)

 On Wed, 4 Nov 2009, Dustin Marquess wrote:
 
 > panic: LOCKDEBUG
 > Stopped in pid 0.36 (system) at netbsd:cpu_Debugger+0x4:        ret     zero,(ra
 > )
 > db{0}> bt
 > cpu_Debugger() at netbsd:cpu_Debugger+0x4
 > panic() at netbsd:panic+0x244
 > lockdebug_abort1() at netbsd:lockdebug_abort1+0x150
 > mutex_abort() at netbsd:mutex_abort+0x34
 > mutex_vector_enter() at netbsd:mutex_vector_enter+0x324
 > pool_cache_put_slow() at netbsd:pool_cache_put_slow+0x198
 > pool_cache_put_paddr() at netbsd:pool_cache_put_paddr+0x1b0
 > pmap_do_tlb_shootdown() at netbsd:pmap_do_tlb_shootdown+0x174
 > alpha_ipi_process() at netbsd:alpha_ipi_process+0xb8
 > interrupt() at netbsd:interrupt+0x84
 > XentInt() at netbsd:XentInt+0x1c
 > --- interrupt (from ipl 4) ---
 > lockdebug_wantlock() at netbsd:lockdebug_wantlock+0x218
 > mutex_tryenter() at netbsd:mutex_tryenter+0x274
 > pool_cache_get_slow() at netbsd:pool_cache_get_slow+0x78
 > pool_cache_get_paddr() at netbsd:pool_cache_get_paddr+0x26c
 > pmap_tlb_shootdown() at netbsd:pmap_tlb_shootdown+0x248
 > pmap_kremove() at netbsd:pmap_kremove+0x124
 > uvm_pagermapout() at netbsd:uvm_pagermapout+0x40
 > uvm_aio_aiodone() at netbsd:uvm_aio_aiodone+0xf4
 >
 > I'm still in ddb if I need to do anything else.
 
    It confirms one of the deadlocks I suspected.  I should soon have a
 patch you can try out.
 
 --
 Michael L. Hitch mhitch@...
 Computer Consultant
 Information Technology Center
 Montana State University Bozeman, MT USA
 

Parent Message unknown Re: port-alpha/42174

by Michael L. Hitch :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The following reply was made to PR port-alpha/42174; it has been noted by GNATS.

From: "Michael L. Hitch" <mhitch@...>
To: gnats-bugs@...
Cc: dmarquess@...
Subject: Re: port-alpha/42174
Date: Sun, 8 Nov 2009 21:36:36 -0700 (MST)

 On Thu, 5 Nov 2009, Michael L. Hitch wrote:
 
 > > I'm still in ddb if I need to do anything else.
 >
 >    It confirms one of the deadlocks I suspected.  I should soon have a
 > patch you can try out.
 
    I've committed a fix to current that should take care of this.
 
 Index: sys/arch/alpha/alpha/pmap.c
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/alpha/alpha/pmap.c,v
 retrieving revision 1.249
 diff -u -p -r1.249 pmap.c
 --- sys/arch/alpha/alpha/pmap.c 7 Nov 2009 07:27:40 -0000       1.249
 +++ sys/arch/alpha/alpha/pmap.c 9 Nov 2009 04:21:12 -0000
 @@ -3581,6 +3581,7 @@ pmap_tlb_shootdown(pmap_t pmap, vaddr_t
                  cpumask |= 1UL << ci->ci_cpuid;
 
                  pq = &pmap_tlb_shootdown_q[ci->ci_cpuid];
 +               mutex_spin_enter(&pq->pq_lock);
 
                  /*
                   * Allocate a job.
 @@ -3596,7 +3597,6 @@ pmap_tlb_shootdown(pmap_t pmap, vaddr_t
                   * If a global flush is already pending, we
                   * don't really have to do anything else.
                   */
 -               mutex_spin_enter(&pq->pq_lock);
                  pq->pq_pte |= pte;
                  if (pq->pq_tbia) {
                          mutex_spin_exit(&pq->pq_lock);
 
 
 --
 Michael L. Hitch mhitch@...
 Computer Consultant
 Information Technology Center
 Montana State University Bozeman, MT USA
 

Parent Message unknown Re: port-alpha/42174

by Dustin Marquess :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The following reply was made to PR port-alpha/42174; it has been noted by GNATS.

From: Dustin Marquess <dmarquess@...>
To: "Michael L. Hitch" <mhitch@...>
Cc: gnats-bugs@...
Subject: Re: port-alpha/42174
Date: Tue, 10 Nov 2009 09:54:18 -0600

 I applied Michael's patch along w/ Jason Thorpe's L1 patch to a fresh
 netbsd-5 checkout.  Recompiled w/ DEBUG/DIAGNOSTIC/LOCKDEBUG.  So far
 the new kernel has been banged on for 24 hours w/o even a slight
 hiccup.  Before the patches, it was lucky if it made it 5 minutes w/
 DIAGNOSTIC/LOCKDEBUG.
 
 So while it hasn't had in-depth testing, it is definitely a *LOT* more
 stable than what it was, so hopefully this PR can be resolved now.
 
 On a side note, the disk issue I was having seems to be unrelated.  It
 appears that something (not sure if disk itself or controller) would
 get in some kind of weird stat.  That would cause problems and make it
 appear that the disk had no label.  I swapped the drive from the
 Adaptec 29160 that I had installed to make Tru64 happy back to the
 internal 53c1010, and so far the problem hasn't reoccurred.  So it
 seems to be completely unrelated, sorry!