|
View:
New views
9 Messages
—
Rating Filter:
Alert me
|
|
|
|
|
|
Re: --fuzzy search over to-be-deleted files to catch moved files and directoriesOn Tue, 2009-11-10 at 17:55 +0100, H. Langos wrote:
> Now the question is, if fuzzy search could be extended to search for > moved files across directory borders. If for example I move pictures > aound to reorganize them in a different directory hierarchy, or if I > move whole directories around as I reorganize part of my music files. Consider the --detect-renamed option provided by the maintained patch "detect-renamed.diff". It will find moved files that match exactly according to the "quick check" in effect (size + mtime or checksum). It doesn't calculate name similarity like --fuzzy because that would be prohibitively expensive in the current implementation. -- Matt -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html |
|
|
Re: --fuzzy search over to-be-deleted files to catch moved files and directoriesHi Matt,
Thank you very much for the quick response! On Tue, Nov 10, 2009 at 12:13:09PM -0500, Matt McCutchen wrote: > On Tue, 2009-11-10 at 17:55 +0100, H. Langos wrote: > > Now the question is, if fuzzy search could be extended to search for > > moved files across directory borders. If for example I move pictures > > aound to reorganize them in a different directory hierarchy, or if I > > move whole directories around as I reorganize part of my music files. > > Consider the --detect-renamed option provided by the maintained patch > "detect-renamed.diff". That sounds just like the thing. I applied the patch from git://git.samba.org/rsync-patches.git that is tagged v3.0.6 (d64936b9..) and it builds nicely with the debian lenny source package of 3.0.6 .. now I'll have to see how it works. > It will find moved files that match exactly > according to the "quick check" in effect (size + mtime or checksum). That is basename+size+mtime or basename+checksum, right? How does "--detect-renamed" interact with "--fuzzy" and "--delete-after"? > It doesn't calculate name similarity like --fuzzy because that would > be prohibitively expensive in the current implementation. Why would it be so expensive? Only files of the same size should be candidates to start with, right. For small files (where most same-size collisions will occure) the gain of fuzzy detecting renames is probably not worth it. Normal move detection however will be helpful when moving e.g. a kernel source tree around. For big files there should be very few candidates to do a fuzzy comparison against. So cost should be rather low for bigger files. The thing that I am worried about is the case of DVD backups. Here you get a lot of big VTS_01_1.VOB files that all have the same size: 1073739776 bytes. So basename+filesize already match and the only thing that would avoid loss of data would be the mtime. Frankly I wouldn't want to trust the integrity of my DVD backups to a timestamp. Is there a way to enforce checksum tests for moved files while keeping size+mtime tests for files that didn't get moved/renamed ? cheers -henrik -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html |
|
|
Re: --fuzzy search over to-be-deleted files to catch moved files and directoriesOn Wed, Nov 11, 2009 at 12:17:50PM +0100, H. Langos wrote:
> Hi Matt, > > Thank you very much for the quick response! > > On Tue, Nov 10, 2009 at 12:13:09PM -0500, Matt McCutchen wrote: > > On Tue, 2009-11-10 at 17:55 +0100, H. Langos wrote: > > > Now the question is, if fuzzy search could be extended to search for > > > moved files across directory borders. If for example I move pictures > > > aound to reorganize them in a different directory hierarchy, or if I > > > move whole directories around as I reorganize part of my music files. > > > > Consider the --detect-renamed option provided by the maintained patch > > "detect-renamed.diff". > > That sounds just like the thing. > I applied the patch from git://git.samba.org/rsync-patches.git > that is tagged v3.0.6 (d64936b9..) and it builds nicely with the debian > lenny source package of 3.0.6 .. now I'll have to see how it works. It does work. It will only show up if you enable -vvv though. Unfortunately it still does not solve my problem. "--detect-renamed" hardlinks when it moves deleted files to its temp directory. But then it seems to copy the file to its new location. This breaks the space saving property of rsnapshot (hardlinking files that didn't change since the last snapshot). So in order to realy work in my setup one would have to check if the delta transfer did change any data at all. If it didn't a hardlink would be made instead of a copy. > Is there a way to enforce checksum tests for moved files while keeping > size+mtime tests for files that didn't get moved/renamed ? According to the patches documentation the delta-transfer algorithm will be used for any detected rename. So my data should be safe :-) -henrik -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html |
|
|
Re: --fuzzy search over to-be-deleted files to catch moved files and directoriesAttempting to address each of your questions, here and then in your
other message... On Wed, 2009-11-11 at 12:17 +0100, H. Langos wrote: > > It will find moved files that match exactly > > according to the "quick check" in effect (size + mtime or checksum). > > That is basename+size+mtime or basename+checksum, right? No, a basename match is not a requirement (hence the ability to detect renames), but it is a tie-breaker. > How does "--detect-renamed" interact with "--fuzzy" and "--delete-after"? --detect-renamed and --fuzzy are two different means of finding basis files that overlap in some cases but do not really interact. --detect-renamed considers the whole destination using the quick check, while --fuzzy considers only the same destination subdir using size+mtime or otherwise name similarity. --delete-before and --delete-during may reduce the effectiveness of --fuzzy, as stated in the man page description of --fuzzy, but they do not affect --detect-renamed since --detect-renamed actually works during the delete pass. > > It doesn't calculate name similarity like --fuzzy because that would > > be prohibitively expensive in the current implementation. > Only files of the same size should be > candidates to start with, right. No, the name similarity calculation I'm talking about is the fallback to select a similar basis file when no available destination file passes the quick check, so it does not require a size match. > Why would it be so expensive? Wayne said so here: https://bugzilla.samba.org/show_bug.cgi?id=3392#c11 I haven't done any tests it myself. -- Matt -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html |
|
|
Re: --fuzzy search over to-be-deleted files to catch moved files and directoriesOn Wed, 2009-11-11 at 14:32 +0100, H. Langos wrote:
> Unfortunately it still does not solve my problem. "--detect-renamed" > hardlinks when it moves deleted files to its temp directory. But then > it seems to copy the file to its new location. > > This breaks the space saving property of rsnapshot (hardlinking files that > didn't change since the last snapshot). > > So in order to realy work in my setup one would have to check if the delta > transfer did change any data at all. If it didn't a hardlink would be made > instead of a copy. Ah. The RFE for the check you describe is: https://bugzilla.samba.org/show_bug.cgi?id=5583 A currently available alternative is the --detect-moved option added by "detect-renamed-lax.diff", which will accept a file matching in basename + quick check directly, without running the delta-transfer algorithm. This is risky, but not quite as risky as --detect-renamed-lax. -- Matt -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html |
|
|
Re: --fuzzy search over to-be-deleted files to catch moved files and directoriesHi Matt,
Thank you very much for answering those questions and helping me to understand rsync better! On Thu, Nov 12, 2009 at 11:20:19PM -0500, Matt McCutchen wrote: > Attempting to address each of your questions, here and then in your > other message... > > On Wed, 2009-11-11 at 12:17 +0100, H. Langos wrote: > > > It will find moved files that match exactly > > > according to the "quick check" in effect (size + mtime or checksum). > > > > That is basename+size+mtime or basename+checksum, right? > > No, a basename match is not a requirement (hence the ability to detect > renames), but it is a tie-breaker. Ahh, ok, so here size+mtime or checksum select the base file. And if that selection fails then "--fuzzy" search is applied but looks only in the /dst directory for a suitable candidate. (Or is the temporal order reversed?) > > How does "--detect-renamed" interact with "--fuzzy" and "--delete-after"? > > --detect-renamed and --fuzzy are two different means of finding basis > files that overlap in some cases but do not really interact. > --detect-renamed considers the whole destination using the quick check, > while --fuzzy considers only the same destination subdir using > size+mtime or otherwise name similarity. > > --delete-before and --delete-during may reduce the effectiveness of > --fuzzy, as stated in the man page description of --fuzzy, but they do > not affect --detect-renamed since --detect-renamed actually works during > the delete pass. > > > It doesn't calculate name similarity like --fuzzy because that would > > > be prohibitively expensive in the current implementation. > > Only files of the same size should be > > candidates to start with, right? > > No, the name similarity calculation I'm talking about is the fallback to > select a similar basis file when no available destination file passes > the quick check, so it does not require a size match. Hmm, ok so fuzzy also finds files that are slightly different and have their name slightly changed. This sounds like it would be a good idea to (have the option to) include the delete candidates directory .~tmp~ (or whatever else "--detect-renamed" uses) included in the --fuzzy search. The real world applications are obvious. Apart from software packages as described in https://bugzilla.samba.org/show_bug.cgi?id=3392#c7 (thanks for tha link!), which is aspecial case, using rsync friendly gzip/zlib compression, there is the large area of media files. Example: For my photo collections it would speed things up in the case where I move pictures to a different directory, rename them from DSC_01234.JPG to 20091113-174354_dsc01234.jpg (extracted timestamp from exif data) and add author, license and some keywords to the exif tags. This is not theory. In fact I do just those things with a script when importing pictures from any of my cameras into the photo archive. I rename them as shown above and then I move them to a directory structure made of <year>/<month>/<day>/ . I don't change the exif tags yet, which I wanted to add in the future. But that would make the size+mtime/checksum test fail. Using "--fuzzy" would help, but only if I'd do an rsync between the moving operation and the tag changing operation. No matter which operation I'd do first, but doing both together would mean completely new transfer to my backup location. :-/ Same thing goes for mp3 collections when you finally find the time to tag your new music and move it to the right directory in your collection. > > Why would it be so expensive? > > Wayne said so here: > > https://bugzilla.samba.org/show_bug.cgi?id=3392#c11 Well, I think I'll have to wait then ... or refrain from doing move and change operations at the same time. :-) Thank you very much for your help! cheers -henrik -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html |
|
|
Re: --fuzzy search over to-be-deleted files to catch moved files and directoriesOn Fri, 2009-11-13 at 18:58 +0100, H. Langos wrote:
> Ahh, ok, so here size+mtime or checksum select the base file. > > And if that selection fails then "--fuzzy" search is applied but looks > only in the /dst directory for a suitable candidate. > > (Or is the temporal order reversed?) Yes, that's about right. > > > > [--detect-renamed] doesn't calculate name similarity like --fuzzy because that would > > > > be prohibitively expensive in the current implementation. > > > Only files of the same size should be > > > candidates to start with, right? > > > > No, the name similarity calculation I'm talking about is the fallback to > > select a similar basis file when no available destination file passes > > the quick check, so it does not require a size match. > > Hmm, ok so fuzzy also finds files that are slightly different and have their > name slightly changed. There's no "slightly" on "different" there. Assuming --fuzzy doesn't find a quick-check match (and it probably won't because --detect-renamed has already searched the whole destination with the same criteria), the choice of basis files is based exclusively on name similarity. > This sounds like it would be a good idea to (have the option to) include > the delete candidates directory .~tmp~ (or whatever else "--detect-renamed" > uses) included in the --fuzzy search. I'm not clear on what you're proposing here. Could you provide an example? > In fact I do just those things with a script when > importing pictures from any of my cameras into the photo archive. I > rename them as shown above and then I move them to a directory structure > made of <year>/<month>/<day>/ . I don't change the exif tags yet, which > I wanted to add in the future. > But that would make the size+mtime/checksum test fail. Using "--fuzzy" > would help, but only if I'd do an rsync between the moving operation > and the tag changing operation. > > No matter which operation I'd do first, but doing both together would > mean completely new transfer to my backup location. :-/ Right. Note that if you did an rsync between the moving and the tag changing, you wouldn't need --fuzzy on the second rsync because the files would already be in the right places. Efficiently handling simultaneous renames and data changes is very hard for a stateless tool like rsync. If I understand correctly that you're moving files without changing their basenames, it would work in this case to extend --detect-renamed to look for an exact basename match if there is no quick-check match. That would overlap even more with the current --fuzzy functionality. There may be a better way to factor things. -- Matt -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html |
|
|
Re: --fuzzy search over to-be-deleted files to catch moved files and directoriesOn Sat, Nov 21, 2009 at 09:08:05AM -0500, Matt McCutchen wrote:
> On Fri, 2009-11-13 at 18:58 +0100, H. Langos wrote: > > > > > [--detect-renamed] doesn't calculate name similarity like --fuzzy because that would > > > > > be prohibitively expensive in the current implementation. > > > > Only files of the same size should be > > > > candidates to start with, right? > > > > > > No, the name similarity calculation I'm talking about is the fallback to > > > select a similar basis file when no available destination file passes > > > the quick check, so it does not require a size match. > > > > Hmm, ok so fuzzy also finds files that are slightly different and have their > > name slightly changed. > > There's no "slightly" on "different" there. Assuming --fuzzy doesn't > find a quick-check match (and it probably won't because --detect-renamed > has already searched the whole destination with the same criteria), the > choice of basis files is based exclusively on name similarity. Ok, I see. Does "--fuzzy" check if the filezize is in the same order of magnitude (or at most one order up/down)? Expensive fuzzy string matching on filenames can probably safely be skipped if abs(round(log10(ssize))-round(log10(dsize))) > 1 > > This sounds like it would be a good idea to (have the option to) include > > the delete candidates directory .~tmp~ (or whatever else "--detect-renamed" > > uses) included in the --fuzzy search. > > I'm not clear on what you're proposing here. Could you provide an > example? Ok, I start with this situation, where src and dst are in sync: src/new/foo.jpg src/new/bar.jpg src/2009/ dst/new/foo.jpg dst/new/bar.jpg dst/2009/ Then I run my picture import script. The files are renamed, moved to different directories and some bytes have been added. I end up with something like this: src/new/ src/2009/2009-11-23-foo.jpg src/2009/2009-11-23-bar.jpg dst/new/foo.jpg dst/new/bar.jpg dst/2009/ If I run rsync in this situation, then dst/a/foo.jpg and dst/b/bar.jpg will end up on the destination's to-be-deleted list and --fuzzy would find nothing in dst/2009/ that it could use as base for the "new" files src/2009/2009-11-23-foo.jpg and src/2009/2009-11-23-bar.jpg. What I propose is that, lets call it "--fuzzy-detect-renamed" should not only look in the same directory but also in the to-be-deleted list that "--detect-renamed" uses as temporary asylum for deletion/renaming candidates. Since foo.jpg and bar.jpg are on that to-be-deleted list, my expectation of that new behavior is that foo.jpg would be taken as base for 2009-11-23-foo.jpg and bar.jpg would be taken as a base file for 2009-11-23-bar.jpg > > In fact I do just those things with a script when > > importing pictures from any of my cameras into the photo archive. I > > rename them as shown above and then I move them to a directory structure > > made of <year>/<month>/<day>/ . I don't change the exif tags yet, which > > I wanted to add in the future. > > But that would make the size+mtime/checksum test fail. Using "--fuzzy" > > would help, but only if I'd do an rsync between the moving operation > > and the tag changing operation. > > > > No matter which operation I'd do first, but doing both together would > > mean completely new transfer to my backup location. :-/ > > Right. Note that if you did an rsync between the moving and the tag > changing, you wouldn't need --fuzzy on the second rsync because the > files would already be in the right places. Right. > Efficiently handling simultaneous renames and data changes is very hard > for a stateless tool like rsync. If I understand correctly that you're > moving files without changing their basenames, it would work in this > case to extend --detect-renamed to look for an exact basename match if > there is no quick-check match. I do change the basename too (e.g. I rename "img_1023.jpg" to "2009-10-18_img_1023.jpg") but in a way that fuzzy matching should be able catch). > That would overlap even more with the current --fuzzy functionality. > There may be a better way to factor things. Right. There are a lot of options that change the way rsync looks for quick-check or basefile candidates and due to the organic growth of features their behavior is not always as the users expect. Maybe it is time to think about a more consistent way to control the search for a basefile and quick-check candidate. My first idea would be to add a more explicit form of control. E.g. lists of key value pairs that say _what_ aspect of a file you want to match and _how good_ you need it to match it for passing the quick-check or for usage as a base for the delta transfer. Existing options can easily be translated into that explicit form so that internally there would only be one control logic. Here are some examples of the current options translated into that new schema (I hope I got them right, but keep in mind that this is just a sketch): default behaviour of rsync is something like this: --quick path=same,filename=same,size=same,mtime=same --delta path=same,filename=same when given the "--checksum" optione it is: --quick path=same,filename=same,checksum=same --delta path=same,filename=same with the current "--fuzzy" option it is --quick path=same,filename=same,size=same,mtime=same --delta path=same,filename=same --delta path=same,filename=fuzzy with the current "--detect-renamed" option it is --quick path=same,filename=same,size=same,mtime=same --delta path=same,filename=same --delta path=deleted,size=same,mtime=same this is more easily extendible as new aspects can be added without changing current behaviour and new "qualities" of matching can be added to express stuff like explicit source files (regardless of the src filename): --delta path=some/arbitrary/path/,filename=foo.img a "pool directory" of source files: --quick path=my/pool/path/,filename=same,mtime=same,size=same --delta path=my/pool/path/,filename=fuzzy,size=fuzzy only use files as base if they are smaller: --delta path=same,filename=same,size=smaller you could even express when to skip the delta comparisons completely. e.g. if the destination file was created before the source file (a situation that you encounter when syncing a directory with rotating log files and a rotation has taken place at /src) --whole path=same,filename=same,ctime=older sure this schema is more verbose than the current set of options, but people would use it in scripts rather than on the command line and there you want your commands to be as verbose and explicit as possible. after all you'll want somebody else to understand your scripts without reading all command's man pages and you'll want the behavior to stay constant even when the next mayor version of rsync changes the behavior of one of the summary options. cheers -henrik -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html |
| Free embeddable forum powered by Nabble | Forum Help |