--fuzzy search over to-be-deleted files to catch moved files and directories

View: New views
9 Messages — Rating Filter:   Alert me  

Parent Message unknown --fuzzy search over to-be-deleted files to catch moved files and directories

by H. Langos-8 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I use rsync via rsnapshot to backup a large collection of files.
Using "--fuzzy" already saves me a lot of time by discovering most
renames that happen within a directory. Therfore I already use the
"--delete-after" option.

Now the question is, if fuzzy search could be extended to search for
moved files across directory borders. If for example I move pictures
aound to reorganize them in a different directory hierarchy, or if I
move whole directories around as I reorganize part of my music files.

I know there is --compare-dest but that requires knowledge of the old
location.

As i understand it, the"--delete-after" option already implies that the
receiving side has to keep all to-be-deleted files till the end of the
transfer.
So, instead of searching all files at the destination for a match, would
it be possible to also search that list of files that otherwise would
be deleted, and use them as potential fuzzy-match files?

If the performance penalty for performing this "file-resurrection" on
lots of files would be too big, there could be an options to only keep
files over a certain size in that list "second chance" list.

Does that sound like a useful feature?

cheers
-henrik

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Re: --fuzzy search over to-be-deleted files to catch moved files and directories

by Matt McCutchen-7 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, 2009-11-10 at 17:55 +0100, H. Langos wrote:
> Now the question is, if fuzzy search could be extended to search for
> moved files across directory borders. If for example I move pictures
> aound to reorganize them in a different directory hierarchy, or if I
> move whole directories around as I reorganize part of my music files.

Consider the --detect-renamed option provided by the maintained patch
"detect-renamed.diff".  It will find moved files that match exactly
according to the "quick check" in effect (size + mtime or checksum).  It
doesn't calculate name similarity like --fuzzy because that would be
prohibitively expensive in the current implementation.

--
Matt

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Re: --fuzzy search over to-be-deleted files to catch moved files and directories

by H. Langos-8 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Matt,

Thank you very much for the quick response!

On Tue, Nov 10, 2009 at 12:13:09PM -0500, Matt McCutchen wrote:
> On Tue, 2009-11-10 at 17:55 +0100, H. Langos wrote:
> > Now the question is, if fuzzy search could be extended to search for
> > moved files across directory borders. If for example I move pictures
> > aound to reorganize them in a different directory hierarchy, or if I
> > move whole directories around as I reorganize part of my music files.
>
> Consider the --detect-renamed option provided by the maintained patch
> "detect-renamed.diff".  

That sounds just like the thing.
I applied the patch from git://git.samba.org/rsync-patches.git
that is tagged v3.0.6 (d64936b9..) and it builds nicely with the debian
lenny source package of 3.0.6 .. now I'll have to see how it works.

> It will find moved files that match exactly
> according to the "quick check" in effect (size + mtime or checksum).

That is basename+size+mtime  or basename+checksum, right?

How does "--detect-renamed" interact with "--fuzzy" and "--delete-after"?

> It doesn't calculate name similarity like --fuzzy because that would
> be prohibitively expensive in the current implementation.

Why would it be so expensive? Only files of the same size should be
candidates to start with, right. For small files (where most same-size
collisions will occure) the gain of fuzzy detecting renames is probably
not worth it. Normal move detection however will be helpful when moving
e.g. a kernel source tree around. For big files there should be very few
candidates to do a fuzzy comparison against. So cost should be rather low
for bigger files.

The thing that I am worried about is the case of DVD backups.
Here you get a lot of big VTS_01_1.VOB files that all have the same
size: 1073739776 bytes. So basename+filesize already match and the only
thing that would avoid loss of data would be the mtime. Frankly I
wouldn't want to trust the integrity of my DVD backups to a timestamp.
 
Is there a way to enforce checksum tests for moved files while keeping
size+mtime tests for files that didn't get moved/renamed ?

cheers
-henrik

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Re: --fuzzy search over to-be-deleted files to catch moved files and directories

by H. Langos-8 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Nov 11, 2009 at 12:17:50PM +0100, H. Langos wrote:

> Hi Matt,
>
> Thank you very much for the quick response!
>
> On Tue, Nov 10, 2009 at 12:13:09PM -0500, Matt McCutchen wrote:
> > On Tue, 2009-11-10 at 17:55 +0100, H. Langos wrote:
> > > Now the question is, if fuzzy search could be extended to search for
> > > moved files across directory borders. If for example I move pictures
> > > aound to reorganize them in a different directory hierarchy, or if I
> > > move whole directories around as I reorganize part of my music files.
> >
> > Consider the --detect-renamed option provided by the maintained patch
> > "detect-renamed.diff".  
>
> That sounds just like the thing.
> I applied the patch from git://git.samba.org/rsync-patches.git
> that is tagged v3.0.6 (d64936b9..) and it builds nicely with the debian
> lenny source package of 3.0.6 .. now I'll have to see how it works.


It does work. It will only show up if you enable -vvv though.

Unfortunately it still does not solve my problem. "--detect-renamed"
hardlinks when it moves deleted files to its temp directory. But then
it seems to copy the file to its new location.

This breaks the space saving property of rsnapshot (hardlinking files that
didn't change since the last snapshot).

So in order to realy work in my setup one would have to check if the delta
transfer did change any data at all. If it didn't a hardlink would be made
instead of a copy.


> Is there a way to enforce checksum tests for moved files while keeping
> size+mtime tests for files that didn't get moved/renamed ?

According to the patches documentation the delta-transfer algorithm will
be used for any detected rename. So my data should be safe :-)

-henrik

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Re: --fuzzy search over to-be-deleted files to catch moved files and directories

by Matt McCutchen-7 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Attempting to address each of your questions, here and then in your
other message...

On Wed, 2009-11-11 at 12:17 +0100, H. Langos wrote:
> > It will find moved files that match exactly
> > according to the "quick check" in effect (size + mtime or checksum).
>
> That is basename+size+mtime  or basename+checksum, right?

No, a basename match is not a requirement (hence the ability to detect
renames), but it is a tie-breaker.

> How does "--detect-renamed" interact with "--fuzzy" and "--delete-after"?

--detect-renamed and --fuzzy are two different means of finding basis
files that overlap in some cases but do not really interact.
--detect-renamed considers the whole destination using the quick check,
while --fuzzy considers only the same destination subdir using
size+mtime or otherwise name similarity.

--delete-before and --delete-during may reduce the effectiveness of
--fuzzy, as stated in the man page description of --fuzzy, but they do
not affect --detect-renamed since --detect-renamed actually works during
the delete pass.

> > It doesn't calculate name similarity like --fuzzy because that would
> > be prohibitively expensive in the current implementation.

> Only files of the same size should be
> candidates to start with, right.

No, the name similarity calculation I'm talking about is the fallback to
select a similar basis file when no available destination file passes
the quick check, so it does not require a size match.

> Why would it be so expensive?

Wayne said so here:

https://bugzilla.samba.org/show_bug.cgi?id=3392#c11

I haven't done any tests it myself.

--
Matt

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Re: --fuzzy search over to-be-deleted files to catch moved files and directories

by Matt McCutchen-7 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, 2009-11-11 at 14:32 +0100, H. Langos wrote:

> Unfortunately it still does not solve my problem. "--detect-renamed"
> hardlinks when it moves deleted files to its temp directory. But then
> it seems to copy the file to its new location.
>
> This breaks the space saving property of rsnapshot (hardlinking files that
> didn't change since the last snapshot).
>
> So in order to realy work in my setup one would have to check if the delta
> transfer did change any data at all. If it didn't a hardlink would be made
> instead of a copy.

Ah.  The RFE for the check you describe is:

https://bugzilla.samba.org/show_bug.cgi?id=5583

A currently available alternative is the --detect-moved option added by
"detect-renamed-lax.diff", which will accept a file matching in basename
+ quick check directly, without running the delta-transfer algorithm.
This is risky, but not quite as risky as --detect-renamed-lax.

--
Matt

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Re: --fuzzy search over to-be-deleted files to catch moved files and directories

by H. Langos-8 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Matt,

Thank you very much for answering those questions and helping me to
understand rsync better!

On Thu, Nov 12, 2009 at 11:20:19PM -0500, Matt McCutchen wrote:

> Attempting to address each of your questions, here and then in your
> other message...
>
> On Wed, 2009-11-11 at 12:17 +0100, H. Langos wrote:
> > > It will find moved files that match exactly
> > > according to the "quick check" in effect (size + mtime or checksum).
> >
> > That is basename+size+mtime  or basename+checksum, right?
>
> No, a basename match is not a requirement (hence the ability to detect
> renames), but it is a tie-breaker.

Ahh, ok, so here size+mtime or checksum select the base file.

And if that selection fails then "--fuzzy" search is applied but looks
only in the /dst directory for a suitable candidate.

(Or is the temporal order reversed?)

> > How does "--detect-renamed" interact with "--fuzzy" and "--delete-after"?
>
> --detect-renamed and --fuzzy are two different means of finding basis
> files that overlap in some cases but do not really interact.
> --detect-renamed considers the whole destination using the quick check,
> while --fuzzy considers only the same destination subdir using
> size+mtime or otherwise name similarity.
>
> --delete-before and --delete-during may reduce the effectiveness of
> --fuzzy, as stated in the man page description of --fuzzy, but they do
> not affect --detect-renamed since --detect-renamed actually works during
> the delete pass.
...
> > > It doesn't calculate name similarity like --fuzzy because that would
> > > be prohibitively expensive in the current implementation.
> > Only files of the same size should be
> > candidates to start with, right?
>
> No, the name similarity calculation I'm talking about is the fallback to
> select a similar basis file when no available destination file passes
> the quick check, so it does not require a size match.

Hmm, ok so fuzzy also finds files that are slightly different and have their
name slightly changed.

This sounds like it would be a good idea to (have the option to) include
the delete candidates directory .~tmp~ (or whatever else "--detect-renamed"
uses) included in the --fuzzy search.

The real world applications are obvious. Apart from software packages
as described in https://bugzilla.samba.org/show_bug.cgi?id=3392#c7 
(thanks for tha link!), which is aspecial case, using rsync friendly
gzip/zlib compression, there is the large area of media files.

Example:
For my photo collections it would speed things up in the case where I
move pictures to a different directory, rename them from DSC_01234.JPG to
20091113-174354_dsc01234.jpg (extracted timestamp from exif data) and
add author, license and some keywords to the exif tags.

This is not theory. In fact I do just those things with a script when
importing pictures from any of my cameras into the photo archive. I
rename them as shown above and then I move them to a directory structure
made of <year>/<month>/<day>/ . I don't change the exif tags yet, which
I wanted to add in the future.
But that would make the  size+mtime/checksum test fail. Using "--fuzzy"
would help, but only if I'd do an rsync between the moving operation
and the tag changing operation.

No matter which operation I'd do first, but doing both together would
mean completely new transfer to my backup location. :-/


Same thing goes for mp3 collections when you finally find the time
to tag your new music and move it to the right directory in your
collection.
 
> > Why would it be so expensive?
>
> Wayne said so here:
>
> https://bugzilla.samba.org/show_bug.cgi?id=3392#c11

Well, I think I'll have to wait then ... or refrain from doing move
and change operations at the same time. :-)

Thank you very much for your help!

cheers
-henrik
 
--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Re: --fuzzy search over to-be-deleted files to catch moved files and directories

by Matt McCutchen-7 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Fri, 2009-11-13 at 18:58 +0100, H. Langos wrote:
> Ahh, ok, so here size+mtime or checksum select the base file.
>
> And if that selection fails then "--fuzzy" search is applied but looks
> only in the /dst directory for a suitable candidate.
>
> (Or is the temporal order reversed?)

Yes, that's about right.

> > > > [--detect-renamed] doesn't calculate name similarity like --fuzzy because that would
> > > > be prohibitively expensive in the current implementation.
> > > Only files of the same size should be
> > > candidates to start with, right?
> >
> > No, the name similarity calculation I'm talking about is the fallback to
> > select a similar basis file when no available destination file passes
> > the quick check, so it does not require a size match.
>
> Hmm, ok so fuzzy also finds files that are slightly different and have their
> name slightly changed.

There's no "slightly" on "different" there.  Assuming --fuzzy doesn't
find a quick-check match (and it probably won't because --detect-renamed
has already searched the whole destination with the same criteria), the
choice of basis files is based exclusively on name similarity.

> This sounds like it would be a good idea to (have the option to) include
> the delete candidates directory .~tmp~ (or whatever else "--detect-renamed"
> uses) included in the --fuzzy search.

I'm not clear on what you're proposing here.  Could you provide an
example?

> In fact I do just those things with a script when
> importing pictures from any of my cameras into the photo archive. I
> rename them as shown above and then I move them to a directory structure
> made of <year>/<month>/<day>/ . I don't change the exif tags yet, which
> I wanted to add in the future.
> But that would make the  size+mtime/checksum test fail. Using "--fuzzy"
> would help, but only if I'd do an rsync between the moving operation
> and the tag changing operation.
>
> No matter which operation I'd do first, but doing both together would
> mean completely new transfer to my backup location. :-/

Right.  Note that if you did an rsync between the moving and the tag
changing, you wouldn't need --fuzzy on the second rsync because the
files would already be in the right places.

Efficiently handling simultaneous renames and data changes is very hard
for a stateless tool like rsync.  If I understand correctly that you're
moving files without changing their basenames, it would work in this
case to extend --detect-renamed to look for an exact basename match if
there is no quick-check match.  That would overlap even more with the
current --fuzzy functionality.  There may be a better way to factor
things.

--
Matt

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Re: --fuzzy search over to-be-deleted files to catch moved files and directories

by H. Langos-8 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sat, Nov 21, 2009 at 09:08:05AM -0500, Matt McCutchen wrote:

> On Fri, 2009-11-13 at 18:58 +0100, H. Langos wrote:
> > > > > [--detect-renamed] doesn't calculate name similarity like --fuzzy because that would
> > > > > be prohibitively expensive in the current implementation.
> > > > Only files of the same size should be
> > > > candidates to start with, right?
> > >
> > > No, the name similarity calculation I'm talking about is the fallback to
> > > select a similar basis file when no available destination file passes
> > > the quick check, so it does not require a size match.
> >
> > Hmm, ok so fuzzy also finds files that are slightly different and have their
> > name slightly changed.
>
> There's no "slightly" on "different" there.  Assuming --fuzzy doesn't
> find a quick-check match (and it probably won't because --detect-renamed
> has already searched the whole destination with the same criteria), the
> choice of basis files is based exclusively on name similarity.

Ok, I see. Does "--fuzzy" check if the filezize is in the same order of
magnitude (or at most one order up/down)?
Expensive fuzzy string matching on filenames can probably safely be skipped
if abs(round(log10(ssize))-round(log10(dsize))) > 1

> > This sounds like it would be a good idea to (have the option to) include
> > the delete candidates directory .~tmp~ (or whatever else "--detect-renamed"
> > uses) included in the --fuzzy search.
>
> I'm not clear on what you're proposing here.  Could you provide an
> example?

Ok, I start with this situation, where src and dst are in sync:
src/new/foo.jpg
src/new/bar.jpg
src/2009/

dst/new/foo.jpg
dst/new/bar.jpg
dst/2009/

Then I run my picture import script. The files are renamed, moved to
different directories and some bytes have been added. I end up with
something like this:

src/new/
src/2009/2009-11-23-foo.jpg
src/2009/2009-11-23-bar.jpg

dst/new/foo.jpg
dst/new/bar.jpg
dst/2009/

If I run rsync in this situation, then dst/a/foo.jpg and dst/b/bar.jpg will
end up on the destination's to-be-deleted list and --fuzzy would find nothing
in dst/2009/ that it could use as base for the "new" files
src/2009/2009-11-23-foo.jpg and src/2009/2009-11-23-bar.jpg.

What I propose is that, lets call it "--fuzzy-detect-renamed" should not only
look in the same directory but also in the to-be-deleted list that
"--detect-renamed" uses as temporary asylum for deletion/renaming
candidates.

Since foo.jpg and bar.jpg are on that to-be-deleted list, my expectation of
that new behavior is that foo.jpg would be taken as base for 2009-11-23-foo.jpg
and bar.jpg would be taken as a base file for 2009-11-23-bar.jpg

> > In fact I do just those things with a script when
> > importing pictures from any of my cameras into the photo archive. I
> > rename them as shown above and then I move them to a directory structure
> > made of <year>/<month>/<day>/ . I don't change the exif tags yet, which
> > I wanted to add in the future.
> > But that would make the  size+mtime/checksum test fail. Using "--fuzzy"
> > would help, but only if I'd do an rsync between the moving operation
> > and the tag changing operation.
> >
> > No matter which operation I'd do first, but doing both together would
> > mean completely new transfer to my backup location. :-/
>
> Right.  Note that if you did an rsync between the moving and the tag
> changing, you wouldn't need --fuzzy on the second rsync because the
> files would already be in the right places.

Right.

> Efficiently handling simultaneous renames and data changes is very hard
> for a stateless tool like rsync.  If I understand correctly that you're
> moving files without changing their basenames, it would work in this
> case to extend --detect-renamed to look for an exact basename match if
> there is no quick-check match.

I do change the basename too (e.g. I rename "img_1023.jpg" to
"2009-10-18_img_1023.jpg") but in a way that fuzzy matching should
be able catch).

> That would overlap even more with the current --fuzzy functionality.
> There may be a better way to factor things.

Right. There are a lot of options that change the way rsync looks for
quick-check or basefile candidates and due to the organic growth of features
their behavior is not always as the users expect.

Maybe it is time to think about a more consistent way to control the
search for a basefile and quick-check candidate.

My first idea would be to add a more explicit form of control. E.g. lists
of key value pairs that say _what_ aspect of a file you want to match
and _how good_ you need it to match it for passing the quick-check or
for usage as a base for the delta transfer.
Existing options can easily be translated into that explicit form so that
internally there would only be one control logic.

Here are some examples of the current options translated into that new
schema (I hope I got them right, but keep in mind that this is just a
sketch):

default behaviour of rsync is something like this:

 --quick path=same,filename=same,size=same,mtime=same
 --delta path=same,filename=same


when given the "--checksum" optione it is:

 --quick path=same,filename=same,checksum=same
 --delta path=same,filename=same


with the current "--fuzzy" option it is

 --quick path=same,filename=same,size=same,mtime=same
 --delta path=same,filename=same
 --delta path=same,filename=fuzzy


with the current "--detect-renamed" option it is

 --quick path=same,filename=same,size=same,mtime=same
 --delta path=same,filename=same
 --delta path=deleted,size=same,mtime=same


this is more easily extendible as new aspects can be added without
changing current behaviour and new "qualities" of matching can be
added to express stuff like

explicit source files (regardless of the src filename):

 --delta path=some/arbitrary/path/,filename=foo.img


a "pool directory" of source files:

 --quick path=my/pool/path/,filename=same,mtime=same,size=same
 --delta path=my/pool/path/,filename=fuzzy,size=fuzzy


only use files as base if they are smaller:

 --delta path=same,filename=same,size=smaller

you could even express when to skip the delta comparisons completely.
e.g. if the destination file was created before the source file (a
situation that you encounter when syncing a directory with rotating
log files and a rotation has taken place at /src)

  --whole path=same,filename=same,ctime=older


sure this schema is more verbose than the current set of options, but people
would use it in scripts rather than on the command line and there you want
your commands to be as verbose and explicit as possible. after all you'll
want somebody else to understand your scripts without reading all command's
man pages and you'll want the behavior to stay constant even when the next
mayor version of rsync changes the behavior of one of the summary options.


cheers
-henrik

--
Please use reply-all for most replies to avoid omitting the mailing list.
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html