LFS stability: kern/36608

View: New views
3 Messages — Rating Filter:   Alert me  

LFS stability: kern/36608

by Sverre Froyen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

With the recent discussions on LFS stability, I'd like to point out PR
kern/36608 -- which I still can trigger more or less at will on today's
current.

From my (very incomplete) understanding of the LFS code, it looks like
lfs_vnops.c locks a vnode (simple lock) and then calls ltsleep on an lfs
struct.   Then, before ltsleep returns, something else (exactly what seems to
vary) comes along and locks the same vnode.  This looks to me suspiciously
like a locking bug.

I can "fix" the bug by (1) removing the call to lock the vnode or by (2)
turning off LOCKDEBUG.  I suspect, however, that either solution simply masks
the problem.  Incidentally, implementing (1) reverts the code to what it
looked like before mid April.

Sverre

Re: LFS stability: kern/36608

by Blair Sadewitz-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I suspect it masks the problem.  Try this out:

set vfs.lfs.pagetrip to something sane, such as ssize/4096/4.  Each
time you decrease it (by powers of two), try untarring pkgsrc (this is
on a multiprocessor machine, since we're talking about locking).  The
last decrement in pagetrip should be ssize/PAGE_SIZE.  Notice how, if
you get deadlocks, etc., they seem to happen more often as you go
lower.  Now, try setting pagetrip to something one or two powers of
two lower than ssize/PAGE_SIZE, such as 64.  Now notice how quickly it
locks!

I suspected this has to do with the lfs writer daemon ltsleeping,
given the time it sleeps is fixed.  Haven't gotten a chance to look at
it more, though.

Any thoughts?

Regards,

--Blair

Re: LFS stability: kern/36608

by Sverre Froyen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thursday 13 September 2007, you wrote:

> I suspect it masks the problem.  Try this out:
>
> set vfs.lfs.pagetrip to something sane, such as ssize/4096/4.  Each
> time you decrease it (by powers of two), try untarring pkgsrc (this is
> on a multiprocessor machine, since we're talking about locking).  The
> last decrement in pagetrip should be ssize/PAGE_SIZE.  Notice how, if
> you get deadlocks, etc., they seem to happen more often as you go
> lower.  Now, try setting pagetrip to something one or two powers of
> two lower than ssize/PAGE_SIZE, such as 64.  Now notice how quickly it
> locks!

OK, dumpfs_lfs reports that ssize = 1048576 and sysctl says hw.pagesize =
4096. Thus, ssize/4096/4 = 64 and ssize/PAGE_SIZE = 256, so I set
vfs.lfs.pagetrip=64 (it was 0).

Running my bogofilter test case (see kern/36608) no longer triggers the
LOCKDEBUG assertion.  I will resume regular use of bogofilter in my email
client in order to test more thoroughly.

BTW, my system is a ThinkPad T42, i.e., a uniprocessor system.

> I suspected this has to do with the lfs writer daemon ltsleeping,
> given the time it sleeps is fixed.  Haven't gotten a chance to look at
> it more, though.

Do you think it is the lfs_writer daemon that makes the call to lfs_segunlock,
around line 2290 in lfs_vnops.c (see the PR).  If so, your suspicion sounds
reasonable.  How can I verify this?

> Any thoughts?

I still think the LFS code (see the PR) looks questionable but I've not gotten
any feedback on my comments.

Thanks for your help!

Sverre