pbulk hang in 5.99.21

View: New views
9 Messages — Rating Filter:   Alert me  

pbulk hang in 5.99.21

by wiz-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi!

I've just upgraded my 5.99.21 from Oct 22 to Nov 8 (kernel and
userland, packages not rebuilt), and now a pbulk (set up in a tmpfs)
hangs during the scan phase. Output including some ctrl-t:

Scanning...
................................load: 1.00  cmd: make 29443 [layerfs]
0.13u 0.02s 0% 2196k
make: Working in: /usr/pkgsrc/chat/finch
make: Working in: /usr/pkgsrc/chat/libpurple
make: Working in: /usr/pkgsrc/chat/finch
make: Working in: /usr/pkgsrc/net/avahi
make: Working in: /usr/pkgsrc/x11/gtk2
load: 1.00  cmd: make 29443 [layerfs] 0.13u 0.02s 0% 2196k
make: Working in: /usr/pkgsrc/chat/libpurple
make: Working in: /usr/pkgsrc/chat/finch
make: Working in: /usr/pkgsrc/chat/finch
make: Working in: /usr/pkgsrc/net/avahi
make: Working in: /usr/pkgsrc/x11/gtk2
load: 1.08  cmd: make 29443 [layerfs] 0.14u 0.02s 0% 2196k
make: Working in: /usr/pkgsrc/chat/libpurple
make: Working in: /usr/pkgsrc/x11/gtk2
make: Working in: /usr/pkgsrc/chat/finch
make: Working in: /usr/pkgsrc/net/avahi
make: Working in: /usr/pkgsrc/chat/finch

Second try:
^C
# /usr/pkg_bulk/bin/bulkbuild
Warning: All log files of the previous pbulk run will be
removed in 5 seconds. If you want to abort, press Ctrl-C.
Scanning...
........load: 1.06  cmd: make 27628 [tstile] 0.00u 0.00s 0% 1360k
load: 1.06  cmd: make 27628 [tstile] 0.00u 0.00s 0% 1360k
load: 1.03  cmd: make 27628 [tstile] 0.00u 0.00s 0% 1360k

Any ideas?
 Thomas

Re: pbulk hang in 5.99.21

by wiz-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, Nov 08, 2009 at 11:38:15AM +0100, Thomas Klausner wrote:
> I've just upgraded my 5.99.21 from Oct 22 to Nov 8 (kernel and
> userland, packages not rebuilt), and now a pbulk (set up in a tmpfs)
> hangs during the scan phase. Output including some ctrl-t:

Still happens with today's 5.99.22.
# /usr/pkg_bulk/bin/bulkbuild
Scanning...
........ .......................................... 50/386
.................................................. 100/386
................ .................................. 150/386
.. load: 1.00  cmd: sh 28152 [wait] 0.00u 0.00s 0% 808k
make: Working in: /usr/pkgsrc/misc/tellico
make: Working in: /usr/pkgsrc/misc/tellico

It's been staying there for minutes, the machine is idle.

In case it's a file system locking problem, here's the relevant mount
information:
tmpfs on /home/wiz/sandbox type tmpfs (local)
/bin on /home/wiz/sandbox/bin type null (read-only, local)
/sbin on /home/wiz/sandbox/sbin type null (read-only, local)
/lib on /home/wiz/sandbox/lib type null (read-only, local)
/libexec on /home/wiz/sandbox/libexec type null (read-only, local)
/usr/X11R7 on /home/wiz/sandbox/usr/X11R7 type null (read-only, local)
/usr/bin on /home/wiz/sandbox/usr/bin type null (read-only, local)
/usr/games on /home/wiz/sandbox/usr/games type null (read-only, local)
/usr/include on /home/wiz/sandbox/usr/include type null (read-only, local)
/usr/lib on /home/wiz/sandbox/usr/lib type null (read-only, local)
/usr/libdata on /home/wiz/sandbox/usr/libdata type null (read-only, local)
/usr/libexec on /home/wiz/sandbox/usr/libexec type null (read-only, local)
/usr/share on /home/wiz/sandbox/usr/share type null (read-only, local)
/usr/sbin on /home/wiz/sandbox/usr/sbin type null (read-only, local)
/var/mail on /home/wiz/sandbox/var/mail type null (read-only, local)
/archive/cvs/src on /home/wiz/sandbox/usr/src type null (read-only, local)
/archive/cvs/pkgsrc on /home/wiz/sandbox/usr/pkgsrc type null (local)
/archive/cvs/xsrc on /home/wiz/sandbox/usr/xsrc type null (read-only, local)
/disk/1/archive/packages/5.99.22 on /home/wiz/sandbox/packages type null (local)
/disk/1/archive/distfiles on /home/wiz/sandbox/distfiles type null (local)

 Thomas

Re: pbulk hang in 5.99.21

by wiz-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Another data point:
I've just tried removing the sandbox, umount hangs:
# ./tmpfs-sandbox umount
load: 1.00  cmd: sh 26021 [wait] 0.00u 0.00s 0% 1292k

mount now says:
tmpfs on /home/wiz/sandbox type tmpfs (local)
/bin on /home/wiz/sandbox/bin type null (read-only, local)
/sbin on /home/wiz/sandbox/sbin type null (read-only, local)
/lib on /home/wiz/sandbox/lib type null (read-only, local)
/libexec on /home/wiz/sandbox/libexec type null (read-only, local)
/usr/X11R7 on /home/wiz/sandbox/usr/X11R7 type null (read-only, local)
/usr/bin on /home/wiz/sandbox/usr/bin type null (read-only, local)
/usr/games on /home/wiz/sandbox/usr/games type null (read-only, local)
/usr/include on /home/wiz/sandbox/usr/include type null (read-only, local)
/usr/lib on /home/wiz/sandbox/usr/lib type null (read-only, local)
/usr/libdata on /home/wiz/sandbox/usr/libdata type null (read-only, local)
/usr/libexec on /home/wiz/sandbox/usr/libexec type null (read-only, local)
/usr/share on /home/wiz/sandbox/usr/share type null (read-only, local)
/usr/sbin on /home/wiz/sandbox/usr/sbin type null (read-only, local)
/var/mail on /home/wiz/sandbox/var/mail type null (read-only, local)
/archive/cvs/src on /home/wiz/sandbox/usr/src type null (read-only, local)
/archive/cvs/pkgsrc on /home/wiz/sandbox/usr/pkgsrc type null (local)
/archive/cvs/xsrc on /home/wiz/sandbox/usr/xsrc type null (read-only, local)
/disk/1/archive/packages/5.99.22 on /home/wiz/sandbox/packages type null (local)

 Thomas

Re: pbulk hang in 5.99.21

by enami tsugutomo-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> On Sun, Nov 08, 2009 at 11:38:15AM +0100, Thomas Klausner wrote:
> > I've just upgraded my 5.99.21 from Oct 22 to Nov 8 (kernel and
> > userland, packages not rebuilt), and now a pbulk (set up in a tmpfs)
> > hangs during the scan phase. Output including some ctrl-t:
>
> Still happens with today's 5.99.22.

Here is a workaround I'm trying now.

enami.

Index: sys/kern/vfs_subr.c
===================================================================
RCS file: /cvsroot/src/sys/kern/vfs_subr.c,v
retrieving revision 1.386
diff -u -r1.386 vfs_subr.c
--- sys/kern/vfs_subr.c 5 Nov 2009 08:18:02 -0000 1.386
+++ sys/kern/vfs_subr.c 11 Nov 2009 06:02:33 -0000
@@ -1386,7 +1386,7 @@
 vrelel(vnode_t *vp, int flags)
 {
  bool recycle, defer;
- int error;
+ int error, islayer_vnode;
 
  KASSERT(mutex_owned(&vp->v_interlock));
  KASSERT((vp->v_iflag & VI_MARKER) == 0);
@@ -1425,6 +1425,7 @@
  * XXX This ugly block can be largely eliminated if
  * locking is pushed down into the file systems.
  */
+ islayer_vnode = (vp->v_iflag & VI_LAYER) != 0;
  if (curlwp == uvm.pagedaemon_lwp) {
  /* The pagedaemon can't wait around; defer. */
  defer = true;
@@ -1432,13 +1433,18 @@
  /* We have to try harder. */
  vp->v_iflag &= ~VI_INACTREDO;
  error = vn_lock(vp, LK_EXCLUSIVE | LK_INTERLOCK |
-    LK_RETRY);
+    (islayer_vnode ? LK_NOWAIT : LK_RETRY));
  if (error != 0) {
- /* XXX */
- vpanic(vp, "vrele: unable to lock %p");
- }
- defer = false;
- } else if ((vp->v_iflag & VI_LAYER) != 0) {
+ if (islayer_vnode) {
+ defer = true;
+ mutex_enter(&vp->v_interlock);
+ } else {
+ /* XXX */
+ vpanic(vp, "vrele: unable to lock %p");
+ }
+ } else
+ defer = false;
+ } else if (islayer_vnode) {
  /*
  * Acquiring the stack's lock in vclean() even
  * for an honest vput/vrele is dangerous because

Re: pbulk hang in 5.99.21

by enami tsugutomo-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> > Still happens with today's 5.99.22.
>
> Here is a workaround I'm trying now.

... if the symptom you saw is same as what I saw (layer_node_find() is
trying to vget() a vnode while vrele_thread is trying to vn_lock() the
same vnode).

enami.

Re: pbulk hang in 5.99.21

by wiz-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Nov 11, 2009 at 03:04:49PM +0900, enami tsugutomo wrote:
> Here is a workaround I'm trying now.

With this workaround I haven't seen the problem again -- usually the
pbulk stopped in scanning the first 200-300 packages. With the patch
it has now finished the scanning stage and started building. Thanks!
 Thomas

Re: pbulk hang in 5.99.21

by wiz-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Nov 11, 2009 at 03:04:49PM +0900, enami tsugutomo wrote:
> Here is a workaround I'm trying now.

Do you think bouyer's fix addresses this issue?

Module Name: src
Committed By: bouyer
Date: Sat Nov 28 10:10:18 UTC 2009

Modified Files:
        src/sys/kern: vfs_subr.c

Log Message:
Previous did cause a deadlock with layered FS: the vrele thread
can sleep on the vnode lock, while vget is sleeping on the
VI_INACTNOW flag (or the vget caller is looping on vget returning failure
because of the VI_INACTNOW flag). With layered FSes, the upper and lower
vnodes share the same lock, so the vget() caller above can be already
holding the vnode lock.

Fix by dropping VI_INACTNOW before sleeping on the vnode lock in
vrelel(), and check the ref count again once we have the lock. If the
vnode has more than one reference, donc VOP_INACTIVE it.
Fix PR kern/42318 and PR kern/42377
patch tested by Hisashi T Fujinaka, Joachim K�nig, Stephen Borrill and
Matthias Scheler.


To generate a diff of this commit:
cvs rdiff -u -r1.391 -r1.392 src/sys/kern/vfs_subr.c

Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.


 Thomas


Parent Message unknown Re: pbulk hang in 5.99.21

by enami tsugutomo-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thomas Klausner <wiz@...> writes:

> On Wed, Nov 11, 2009 at 03:04:49PM +0900, enami tsugutomo wrote:
> > Here is a workaround I'm trying now.
>
> Do you think bouyer's fix addresses this issue?
>
> Module Name: src
> Committed By: bouyer
> Date: Sat Nov 28 10:10:18 UTC 2009
>
> Modified Files:
> src/sys/kern: vfs_subr.c
>
> Log Message:
> Previous did cause a deadlock with layered FS: the vrele thread
> can sleep on the vnode lock, while vget is sleeping on the
> VI_INACTNOW flag (or the vget caller is looping on vget returning failure
> because of the VI_INACTNOW flag). With layered FSes, the upper and lower
> vnodes share the same lock, so the vget() caller above can be already
> holding the vnode lock.
>
> Fix by dropping VI_INACTNOW before sleeping on the vnode lock in
> vrelel(), and check the ref count again once we have the lock. If the
> vnode has more than one reference, donc VOP_INACTIVE it.
> Fix PR kern/42318 and PR kern/42377
> patch tested by Hisashi T Fujinaka, Joachim K=EF=BF=BDnig, Stephen Borril=
> l and
> Matthias Scheler.

Yes, almost same effect.  Didn't work for you?

BTW, I guess the vrele thread itself has some flaw.  Probably we need
more worker thread or need a way to throttle vnode allocation (each
processes are now free to do more own work rather than waiting i/o
completion as before).

enami.


Re: pbulk hang in 5.99.21

by wiz-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Dec 02, 2009 at 03:46:46PM +0900, enami tsugutomo wrote:
> Yes, almost same effect.  Didn't work for you?

Good. Seems to work fine for me as well so far (bulk build processed
more than 1000 packages). Thanks!
 Thomas