FTS not ready for a remount during traversal

View: New views
13 Messages — Rating Filter:   Alert me  

FTS not ready for a remount during traversal

by Kamil Dudka :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello,

the FTS module does not seem to be ready for a remount during traversal.
As a minimal example we can take the following situation:

https://bugzilla.redhat.com/show_bug.cgi?id=501848#c20

Descending a directory triggers a shrinkable mount which implies a change
of the device number. However FTS data structures keep the old value of
device number, not reflecting this change. And later, when returning back
to the same directory through "..", fts_safe_changedir() fails for no good
reason (caused by std_dev mismatch).

The attached patch seems to solve the problem by calling fstatat() again
after opening a directory (and updating the device number). The patch is
not supposed to be applied as-is. It's more likely an idea how the problem
could be solved. Maybe something like that might be available via an option
to keep the original behavior unchanged (and eventually gain better
performance in some cases).

Another approach is a workaround proposed by Ondra:

https://bugzilla.redhat.com/attachment.cgi?id=354464

Any idea how to solve the problem? Thanks in advance!

Kamil

[fts.patch]

diff --git a/lib/fts.c b/lib/fts.c
index 40a837e..1394693 100644
--- a/lib/fts.c
+++ b/lib/fts.c
@@ -1269,6 +1269,21 @@ fts_build (register FTS *sp, int type)
        if (cur->fts_info == FTS_NSOK)
  cur->fts_info = fts_stat(sp, cur, false);
 
+       else /* TODO: check for an option here */ {
+ FTSENT tmp = *cur;
+ fts_stat (sp, &tmp, false);
+ if (cur->fts_statp->st_ino == tmp.fts_statp->st_ino
+     && cur->fts_statp->st_dev != tmp.fts_statp->st_dev) {
+   LEAVE_DIR (sp, cur, "4");
+   cur->fts_statp->st_dev = tmp.fts_statp->st_dev;
+   if (! enter_dir (sp, cur))
+   {
+     __set_errno (ENOMEM);
+     return NULL;
+   }
+ }
+       }
+
  /*
  * Nlinks is the number of possible entries of type directory in the
  * directory if we're cheating on stat calls, 0 if we're not doing


Re: FTS not ready for a remount during traversal

by Jim Meyering :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Kamil Dudka wrote:
> the FTS module does not seem to be ready for a remount during traversal.
> As a minimal example we can take the following situation:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=501848#c20

Hi Kamil,

Thanks for the heads up.

However, ...

> Descending a directory triggers a shrinkable mount which implies a change
> of the device number. However FTS data structures keep the old value of
> device number, not reflecting this change. And later, when returning back
> to the same directory through "..", fts_safe_changedir() fails for no good
> reason (caused by std_dev mismatch).

It fails for a very good reason.
A changed device number may indicate an attempt to subvert the traversal,
even though the inode number is the same.

> The attached patch seems to solve the problem by calling fstatat() again
> after opening a directory (and updating the device number). The patch is
> not supposed to be applied as-is. It's more likely an idea how the problem
> could be solved. Maybe something like that might be available via an option
> to keep the original behavior unchanged (and eventually gain better
> performance in some cases).
>
> Another approach is a workaround proposed by Ondra:
>
> https://bugzilla.redhat.com/attachment.cgi?id=354464
>
> Any idea how to solve the problem? Thanks in advance!

I don't see how we can justify any such change.
Being able to detect whether the traversal returns to a previously
visited directory is required for security and reliability.  Weakening
that device/inode equality check by removing the device comparison part
would leave every fts-using tool open to a particularly subtle -- but
nonetheless serious -- type of attack.

Even if you think you needn't worry about an attack (it's admittedly
far-fetched for an attacker to be able to mount/unmount directories
in a hierarchy being traversed) fts has to provide certain guarantees
to its callers, and cannot do so if the device number of a directory
being traversed may change between the pre- and post-visit contact.
Besides, relaxing this check would allow a user's improbable "mv"
command to change the course of their own in-progress rm -rf DIR,
making it remove all of $HOME, rather than just the intended $HOME/DIR.

In addition, I suspect that any file system for which a file's (or in
our case, a directory's) device number may spontaneously change is not
POSIX conforming.

To avoid this, perhaps the reporter can simply exclude the problematic
.snapshot directories:

    du --exclude=.snapshot

Jim



Re: FTS not ready for a remount during traversal

by Kamil Dudka :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Jim,

thanks for quick reply!

On Wednesday 21 of October 2009 21:47:59 Jim Meyering wrote:
> It fails for a very good reason.
> A changed device number may indicate an attempt to subvert the traversal,
> even though the inode number is the same.

I agree.

> > https://bugzilla.redhat.com/attachment.cgi?id=354464
> >
> > Any idea how to solve the problem? Thanks in advance!
>
> I don't see how we can justify any such change.
> Being able to detect whether the traversal returns to a previously
> visited directory is required for security and reliability.  Weakening
> that device/inode equality check by removing the device comparison part
> would leave every fts-using tool open to a particularly subtle -- but
> nonetheless serious -- type of attack.

That's exactly why I've written another patch. The patch proposed by me does
not bypass the ino/dev check on the return. It only updates the stat data
right after opening a directory.

> Even if you think you needn't worry about an attack (it's admittedly
> far-fetched for an attacker to be able to mount/unmount directories
> in a hierarchy being traversed) fts has to provide certain guarantees
> to its callers, and cannot do so if the device number of a directory
> being traversed may change between the pre- and post-visit contact.
> Besides, relaxing this check would allow a user's improbable "mv"
> command to change the course of their own in-progress rm -rf DIR,
> making it remove all of $HOME, rather than just the intended $HOME/DIR.

As probably "completely safe" solution we can trigger the mount before
statting a directory. But this will seriously affect the performance and
moreover introduce a change in behavior in e.g. find -fstype.

> In addition, I suspect that any file system for which a file's (or in
> our case, a directory's) device number may spontaneously change is not
> POSIX conforming.

Well, we can tell to kernel guys that the Linux NFS implementation is not
POSIX conforming and ask them to fix it. But in reality there is not much
they can do with this.

The only known "possible" solution is to trigger the mount within kernel on
each stat call to obtain the correct stat data. This will however punish all
applications, not only the applications which actually need it.

It is fairly well explained by Jeff Layton here:

https://bugzilla.redhat.com/show_bug.cgi?id=501848#c30

> To avoid this, perhaps the reporter can simply exclude the problematic
> .snapshot directories:
>
>     du --exclude=.snapshot

Yes, but only in this specific case and only if we are not interested in
the space occupied by .snapshot directory.

Generally the problem is not specific to .snapshot directories at all. Two
nested NFS mount points are enough to raise the issue.

Kamil



Re: FTS not ready for a remount during traversal

by Jim Meyering :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Kamil Dudka wrote:
...

>> > Any idea how to solve the problem? Thanks in advance!
>>
>> I don't see how we can justify any such change.
>> Being able to detect whether the traversal returns to a previously
>> visited directory is required for security and reliability.  Weakening
>> that device/inode equality check by removing the device comparison part
>> would leave every fts-using tool open to a particularly subtle -- but
>> nonetheless serious -- type of attack.
>
> That's exactly why I've written another patch. The patch proposed by me does
> not bypass the ino/dev check on the return. It only updates the stat data
> right after opening a directory.

Oh!  I didn't look carefully enough at that one.
Have you measured the performance penalty it incurs?
I hope it is possible to do the same thing, but with less of a penalty.

I'm afraid we'll have to do something like that one way or
another.  At best, the impact will be so small that we won't
even have to provide a new "flag" to enable it.



Re: FTS not ready for a remount during traversal

by Kamil Dudka :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu October 22 2009 16:42:48 Jim Meyering wrote:
> Have you measured the performance penalty it incurs?
> I hope it is possible to do the same thing, but with less of a penalty.
>
> I'm afraid we'll have to do something like that one way or
> another.  At best, the impact will be so small that we won't
> even have to provide a new "flag" to enable it.

Here are some basic metrics (/var is ext3, /mnt/archive is NFSv3):

Before patch
============
# time du -sh /var
2.8G    /var

real    0m0.222s
user    0m0.038s
sys     0m0.186s

# time find /var > /dev/null

real    0m0.187s
user    0m0.068s
sys     0m0.118s

# time find /mnt/archive/fotky/ > /dev/null

real    0m0.089s
user    0m0.010s
sys     0m0.031s


After patch
===========
# time du -sh /var
2.8G    /var

real    0m0.220s
user    0m0.051s
sys     0m0.169s

# time find /var > /dev/null

real    0m0.197s
user    0m0.055s
sys     0m0.142s

# time find /mnt/archive/fotky/ > /dev/null

real    0m0.090s
user    0m0.007s
sys     0m0.030s


There is no measurable performance penalty, assuming you have more files than
directories in the tree (which is IMO the common use case). I am attaching
also a filtered strace output showing the cost of the second stat call:

# strace -rve trace=openat,newfstatat -o trace.raw du -sh /var
# grep -A1 -B1 openat trace.raw | lzma -c > trace.lzma

Kamil


trace.lzma (183K) Download Attachment

Re: FTS not ready for a remount during traversal

by Jim Meyering :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Kamil Dudka wrote:
> On Thu October 22 2009 16:42:48 Jim Meyering wrote:
>> Have you measured the performance penalty it incurs?
>> I hope it is possible to do the same thing, but with less of a penalty.
>>
>> I'm afraid we'll have to do something like that one way or
>> another.  At best, the impact will be so small that we won't
>> even have to provide a new "flag" to enable it.
...
> There is no measurable performance penalty, assuming you have more files than
> directories in the tree (which is IMO the common use case). I am attaching
> also a filtered strace output showing the cost of the second stat call:
>
> # strace -rve trace=openat,newfstatat -o trace.raw du -sh /var
> # grep -A1 -B1 openat trace.raw | lzma -c > trace.lzma

Hi Kamil,

Sorry about the delay.
I'm trying to reconcile this file system's behavior with
fundamental assumptions about unchanging device/inode,
and as a result am having second thoughts.

What event causes the directory's stat.st_dev to change?
Opening the directory?  Calling readdir?  Accessing one
of its entries?

Can you describe how to reproduce the problem on rawhide or Fedora 12?

Jim



Re: FTS not ready for a remount during traversal

by Kamil Dudka :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Jim,

On Sat October 31 2009 12:22:01 Jim Meyering wrote:
> I'm trying to reconcile this file system's behavior with
> fundamental assumptions about unchanging device/inode,
> and as a result am having second thoughts.
>
> What event causes the directory's stat.st_dev to change?
> Opening the directory?  Calling readdir?  Accessing one
> of its entries?
>
> Can you describe how to reproduce the problem on rawhide or Fedora 12?

all the info should be already placed at the RH bug (where you are also
subscribed):

https://bugzilla.redhat.com/501848

Let me know if there is anything missing. A new version of the patch is
attached. Any feedback welcome!

Kamil

[0001-fts-do-not-fail-on-a-submount-during-traversal.patch]

From 69fd52bd5c7c4f71629a9a756548bc02ef8905db Mon Sep 17 00:00:00 2001
From: Kamil Dudka <kdudka@...>
Date: Tue, 3 Nov 2009 14:14:53 +0100
Subject: [PATCH] fts: do not fail on a submount during traversal

* lib/fts.c (fts_build): Read the stat info again after opening
a directory.
Original report at http://bugzilla.redhat.com/501848.
---
 ChangeLog |    7 +++++++
 lib/fts.c |   11 +++++++++++
 2 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 239ec70..406749c 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,10 @@
+2009-11-03  Kamil Dudka  <kdudka@...>
+
+ fts: do not fail on a submount during traversal
+ * lib/fts.c (fts_build): Read the stat info again after opening
+ a directory.
+ Original report at http://bugzilla.redhat.com/501848.
+
 2009-11-03  Jim Meyering  <meyering@...>
 
  test-getaddrinfo: avoid compilation failure on FreeBSD 7.2
diff --git a/lib/fts.c b/lib/fts.c
index 40a837e..d0715cf 100644
--- a/lib/fts.c
+++ b/lib/fts.c
@@ -1268,6 +1268,17 @@ fts_build (register FTS *sp, int type)
   opening it.  */
        if (cur->fts_info == FTS_NSOK)
  cur->fts_info = fts_stat(sp, cur, false);
+       else {
+ /* Now read the stat info again after opening a directory to reveal
+  * eventual changes caused by a submount triggered by the traverse.  */
+ LEAVE_DIR (sp, cur, "4");
+ fts_stat (sp, cur, false);
+         if (! enter_dir (sp, cur))
+           {
+             __set_errno (ENOMEM);
+             return NULL;
+           }
+       }
 
  /*
  * Nlinks is the number of possible entries of type directory in the
--
1.6.2.5



Re: FTS not ready for a remount during traversal

by Jim Meyering :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Kamil Dudka wrote:

> On Sat October 31 2009 12:22:01 Jim Meyering wrote:
>> I'm trying to reconcile this file system's behavior with
>> fundamental assumptions about unchanging device/inode,
>> and as a result am having second thoughts.
>>
>> What event causes the directory's stat.st_dev to change?
>> Opening the directory?  Calling readdir?  Accessing one
>> of its entries?
>>
>> Can you describe how to reproduce the problem on rawhide or Fedora 12?
>
> all the info should be already placed at the RH bug (where you are also
> subscribed):
>
> https://bugzilla.redhat.com/501848
>
> Let me know if there is anything missing. A new version of the patch is
> attached. Any feedback welcome!

Hi Kamil,

Thanks for the new patch.
I've built with it and compared before/after performance
using an all-directories hierarchy on a tmpfs file system.

The result is a >10% performance decrease in this contrived worst case:
(Note: previously-built, unpatched version is in prev/,
just-patched is in ./)

  $ z-mkdir 200000; for i in 1 2 3; do for p in prev .; do echo "$p ___";
    env time $p/chgrp -R group2 $TMPDIR/z; done; done; z-rmdir
  prev ___
  1.52user 17.92system 0:19.46elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
  0inputs+0outputs (0major+14369minor)pagefaults 0swaps
  . ___
  1.71user 20.61system 0:22.34elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
  0inputs+0outputs (0major+14368minor)pagefaults 0swaps
  prev ___
  1.54user 17.92system 0:19.47elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
  0inputs+0outputs (0major+14369minor)pagefaults 0swaps
  . ___
  1.63user 20.84system 0:22.48elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
  0inputs+0outputs (0major+14369minor)pagefaults 0swaps
  prev ___
  1.64user 17.86system 0:19.52elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
  0inputs+0outputs (0major+14370minor)pagefaults 0swaps
  . ___
  1.70user 20.69system 0:22.40elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
  0inputs+0outputs (0major+14368minor)pagefaults 0swaps

I saw similar numbers when timing rm -rf and chmod -R.
Thus, I'm reluctant to impose such a penalty without
first exhausting all other possibilities.

Those z-* things are bash functions to create and remove deep
hierarchies:

z-mkdir () {
        local n
        case $# in
                (1) n=$1  ;;
                (*) echo "Usage: $FUNCNAME N" >&2
                        return 1 ;;
        esac
        command perl -w -e 'my $root = $ENV{TMPDIR} || "/tmp";
           chdir $root or die "cannot chdir to $root: $!\n";
           $i=1; do { mkdir "z",0700 or die "mkdir failed: depth $i: $!\n";
                      chdir "z" or die "chdir failed: depth $i: $!\n" }
                 while (++$i <= '$n')'
}
z-rmdir () {
        case $# in
                (0)  ;;
                (*) echo "Usage: $FUNCNAME" >&2
                        return 1 ;;
        esac
        local root=${TMPDIR=/tmp}
        command perl -w -e 'my $root = $ENV{TMPDIR} || "/tmp";
           chdir $root or die "cannot chdir to $root: $!\n";
           while (1) {chdir "z" or last};
           while (1) {chdir ".." && rmdir "z" or last}'
}



Re: FTS not ready for a remount during traversal

by Kamil Dudka :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Jim,

On Wed November 4 2009 13:02:33 Jim Meyering wrote:
> I've built with it and compared before/after performance
> using an all-directories hierarchy on a tmpfs file system.
>
> The result is a >10% performance decrease in this contrived worst case:

thanks for stating the upper boundary estimation!

> (Note: previously-built, unpatched version is in prev/,
> just-patched is in ./)
>
>   $ z-mkdir 200000; for i in 1 2 3; do for p in prev .; do echo "$p ___";
>     env time $p/chgrp -R group2 $TMPDIR/z; done; done; z-rmdir
>   prev ___
>   1.52user 17.92system 0:19.46elapsed 99%CPU (0avgtext+0avgdata
>  0maxresident)k 0inputs+0outputs (0major+14369minor)pagefaults 0swaps
>   . ___
>   1.71user 20.61system 0:22.34elapsed 99%CPU (0avgtext+0avgdata
>  0maxresident)k 0inputs+0outputs (0major+14368minor)pagefaults 0swaps
>   prev ___
>   1.54user 17.92system 0:19.47elapsed 99%CPU (0avgtext+0avgdata
>  0maxresident)k 0inputs+0outputs (0major+14369minor)pagefaults 0swaps
>   . ___
>   1.63user 20.84system 0:22.48elapsed 99%CPU (0avgtext+0avgdata
>  0maxresident)k 0inputs+0outputs (0major+14369minor)pagefaults 0swaps
>   prev ___
>   1.64user 17.86system 0:19.52elapsed 99%CPU (0avgtext+0avgdata
>  0maxresident)k 0inputs+0outputs (0major+14370minor)pagefaults 0swaps
>   . ___
>   1.70user 20.69system 0:22.40elapsed 99%CPU (0avgtext+0avgdata
>  0maxresident)k 0inputs+0outputs (0major+14368minor)pagefaults 0swaps

I can confirm your results, just run your tests on an ext3 file system.

> I saw similar numbers when timing rm -rf and chmod -R.
> Thus, I'm reluctant to impose such a penalty without
> first exhausting all other possibilities.

Yes, it makes sense. However what other possibilities do we have actually?

Preferring higher performance in some rear cases over correct behavior
in more common cases is of course not a possibility for me.

Kamil



Re: FTS not ready for a remount during traversal

by Jim Meyering :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Kamil Dudka wrote:
...
>> I saw similar numbers when timing rm -rf and chmod -R.
>> Thus, I'm reluctant to impose such a penalty without
>> first exhausting all other possibilities.
>
> Yes, it makes sense. However what other possibilities do we have actually?

Change the kernel, or find a way to make fts do what we
want without the added cost.

> Preferring higher performance in some rear cases over correct behavior
> in more common cases is of course not a possibility for me.

If you are implying that I might make such a trade-off,
you have not been paying attention ;-)

BTW, in my experience, this is *not* a common case at all.

In fact, the patch we're considering impacts each and every
directory access in *all* hierarchy traversals, when only a very
small number of them actually need the work-around.



Re: FTS not ready for a remount during traversal

by Eric Blake :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Jim Meyering <jim <at> meyering.net> writes:

> > Yes, it makes sense. However what other possibilities do we have actually?
>
> Change the kernel, or find a way to make fts do what we
> want without the added cost.

We have already asked the kernel folks if they would be willing to make readdir
() report accurate d_ino values for mount points, and they complained that it
would be too expensive to guarantee correct d_ino information compared to the
number of clients that don't care whether it is correct.  But they did mention
the possibility of perhaps adding some sort of flag, or a new d_type, which
identifies mount points, and which the application can then use to make it
obvious that a stat() is necessary to get reliable inode information for _just
that directory_.

This seems like yet another case where convincing the kernel folks to GIVE us
this information would be helpful.  Just knowing whether a directory is a mount
point will give us the power to know whether we need to stat() it after
automounting, without penalizing the common case of a directory that is not a
mount point.  Since mount point semantics are squarely on the shoulders of the
kernel, please consider pushing harder for a fix to come from the kernel,
rather than a workaround in coreutils.

--
Eric Blake






Re: FTS not ready for a remount during traversal

by Kamil Dudka :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wednesday 04 of November 2009 16:12:42 Eric Blake wrote:
> We have already asked the kernel folks if they would be willing to make
> readdir () report accurate d_ino values for mount points, and they
> complained that it would be too expensive to guarantee correct d_ino
> information compared to the number of clients that don't care whether it is
> correct.  But they did mention the possibility of perhaps adding some sort
> of flag, or a new d_type, which identifies mount points, and which the
> application can then use to make it obvious that a stat() is necessary to
> get reliable inode information for _just that directory_.

I completely agree that such a flag would be generally helpful. However we
need to get coreutils working with current kernels anyway.

> This seems like yet another case where convincing the kernel folks to GIVE
> us this information would be helpful.  Just knowing whether a directory is
> a mount point will give us the power to know whether we need to stat() it
> after automounting, without penalizing the common case of a directory that
> is not a mount point.  Since mount point semantics are squarely on the
> shoulders of the kernel, please consider pushing harder for a fix to come
> from the kernel, rather than a workaround in coreutils.

I would contend the patch is a workaround. From my point of view it's
a *bugfix* for coreutils (and other gnulib-based projects). If you think the
kernel's behavior is incorrect, please refer to a specification saying this.

My point is that there is in fact no performance impact caused by the patch
and what we do now is a premature optimization. The FTS module is simply
based on an assumption which is not guaranteed to hold in general.

Kamil



Re: FTS not ready for a remount during traversal

by Jim Meyering :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Kamil Dudka wrote:

> On Wednesday 04 of November 2009 16:12:42 Eric Blake wrote:
>> We have already asked the kernel folks if they would be willing to make
>> readdir () report accurate d_ino values for mount points, and they
>> complained that it would be too expensive to guarantee correct d_ino
>> information compared to the number of clients that don't care whether it is
>> correct.  But they did mention the possibility of perhaps adding some sort
>> of flag, or a new d_type, which identifies mount points, and which the
>> application can then use to make it obvious that a stat() is necessary to
>> get reliable inode information for _just that directory_.
>
> I completely agree that such a flag would be generally helpful. However we
> need to get coreutils working with current kernels anyway.
>
>> This seems like yet another case where convincing the kernel folks to GIVE
>> us this information would be helpful.  Just knowing whether a directory is
>> a mount point will give us the power to know whether we need to stat() it
>> after automounting, without penalizing the common case of a directory that
>> is not a mount point.  Since mount point semantics are squarely on the
>> shoulders of the kernel, please consider pushing harder for a fix to come
>> from the kernel, rather than a workaround in coreutils.
>
> I would contend the patch is a workaround. From my point of view it's
> a *bugfix* for coreutils (and other gnulib-based projects). If you think the
> kernel's behavior is incorrect, please refer to a specification saying this.

The part about distinct directories having identical device/inode
numbers is definitely a bug, but that was incidental.

POSIX does not specify behavior in the presence of mount points.

> My point is that there is in fact no performance impact caused by the patch

Here's an example with fewer than 5000 directories
that demonstrates a 23% impact:

    http://lkml.org/lkml/2009/11/5/366

> and what we do now is a premature optimization. The FTS module is simply
> based on an assumption which is not guaranteed to hold in general.