|
View:
New views
11 Messages
—
Rating Filter:
Alert me
|
|
|
Duplicate file locationHopefully the obsessed photographers here will have at least a good idea about this subject as the experts on any photo list :-) :
Summary: Seeking opinions re duplicate-file location and management software. Needs to work with essentially unlimited capacity and number of files and number of connected or disconnected drives. "Actually working" is highly desirable. Free would be nice but is not essential. Emphasis is on photo files. _________________ I have a large photo collection scattered across many hard drives of various capacity and vintage. There are also DVD and CD backup copies although (wisely or not) in recent years I've tended to use multiple HDDs rather than DVDs for backup. The older the files the more likely that there are numerous "lost" copies. It would be nice to be able to relate & control the number of copies of a given original file. The copies may be identical or have different compressions or in some cases reduced resolutions. Copies will almost invariably share the same basic file name and date and time but may have extended names - usually as suffixes (added on end of original name) but in some cases also as prefixes (meaning ful prefix added at start of name)(and/or even a date coded prefix added.) All copies of a file that matter will share the original EXIF information. Original date/time is preserved to the maximum extent possible**. (Some copying or editing processes* destroy EXIF informatio! n but I try to avoid this whereever possible).(* eg Photoshop has a nasty habit of playing with the EXIF).(** Some programs attempt to apply current date/time despite one's intentions and some (like eg Microsft Windows itself !!!) tend to corrupt the date/time on transfer from eg storage media. It is not uncommon for Windows itself to not be able to properly handle the time/date format of files coming from eg cameras or flash cards and to move the time 12 hours or 1 day or swap day/month or play other games. In such cases the EXIF is usually untouched and can be used to subsequently restore the correct values). (Microsoft have produced software to address this shortcoming in their own products but using it may require you to be embedded in their surrounding cotton wool). . At various stages the control of backups has been reasonably well controlled but occasionally Murphy sneaks in and ... . It is likely that there are at least two copies of every file available. In some cases there are many many more copies. (eg a wedding will have a master copy and a working copy and then subsets and versions and ...). I'd like to tidy up a vast number of files covering many GB of capacity and numerous hard drives. Possibly also various DVDs. Some but not all the drives will be live at once. Drive sizes are typically in the 300GB - 1 TB range but may be down to ?30GB?. Total originals are probably in the 100,000 - 1 million range, intended backups at least double the count, working copies probably add 50% more (only major 'events' get substantial work) and the serendipitous copying (disc upgrades etc) add some more. A total file count under 10 million would probably suffice - and it may be much lower than that. It is almost certain that no existing duplicate management system will meet my need and I'll have to write something myself. But, it would be nice to hear what people do to manage such large collections. "Doing it properly" (apart from doing it from the start) probably involves correlation of core file names, date/time Russell -- http://www.piclist.com PIC/SX FAQ & list archive View/change your membership options at http://mailman.mit.edu/mailman/listinfo/piclist |
|
|
Re: Duplicate file locationOn Oct 28, 2009, at 8:31 PM, Russell wrote: > It is almost certain that no existing duplicate management system > will meet my need and I'll have to write something myself. But, it > would be nice to hear what people do to manage such large > collections. "Doing it properly" (apart from doing it from the > start) probably involves correlation of core file names, date/time Yeah, I have often thought that photos (and probably videos too) have enough meta-content that it ought to be possible to put together some really nice utilities for non-visual tasks like collection merging, duplicate elimination, backup (duplicate creation, eh?) Some of the organization packages do a pretty good job of this just by virtue of the way they organize their photo database (ie iphoto puts things in directories by date of photo taken, which is pretty helpful for detecting duplicates. Two files of the same name and size that were taken on the same date most likely are duplicates...) Google's Picasa tools do something similar. But it ought to be possible to do tasks outside the fancy GUIs a lot more efficiently, and especially so if you add "intimate" knowledge of how the major photo programs organize their data. The only thing I've heard of so far is some CLI utilities that will let you do things based on the exif and other jpeg photo info, and some people have written substantial scripts on top of these to automatically organize photos as they're taken. Let us know what you find out. BillW -- http://www.piclist.com PIC/SX FAQ & list archive View/change your membership options at http://mailman.mit.edu/mailman/listinfo/piclist |
|
|
Re: Duplicate file location> Some of the
> organization packages do a pretty good job of this just by virtue of > the way they organize their photo database (ie iphoto puts things in > directories by date of photo taken, which is pretty helpful for > detecting duplicates. Two files of the same name and size that were > taken on the same date most likely are duplicates...) If files are unmodified then matching is not hard. But there are lots of derivatives or lookalikes that arise along the way that it would be nice to associate together. I don't necessarily want to delete duplicates or derivatives - mainly I want to know that they exist and where they are. Sving an image from an editor, even if notionally unmodifed, will change file size unless you specifically save the identical file. Even at very low loss JPG compression the file size reduces substantially from the camera JPG file size so this is useful when providing DVDs etc for people to view. eg a 4 MB camera JPG may reduce to around 1.5 MB at JPG 90. While you can possibly see the difference in quality it usually requires flicker comparison (alternate between notionally identical images). I have spent alot of time cropping sections out of images and pasting them ext to each other and poring over them at pixel per display pixel resolution or above to see what differences occur. It is almost always extremely subtle at say JPG90 and when viewed with a 12MP image on a 1.5 MP or so LCD screen the differences are essentially or actually non-existent. If using RAW files an editor may produce a reduce dresolution JPG from the embedded JPG and this will have the original file name unless care is taken to differentiate it. eg camera may produce RAW[ActualRAW file, 1.5 MP JPG] + 12 MP JPG. When I edit files I have a semiformal naming convention which adds suffixes to the file name R modified (usually just colour balance/ contrast / saturation / brightness / gamma) RC modified and cropped. Rn or RCn N = 1 2 3 ... for variants ...S with sharpening (may just use R) ... usm - unsharp mask (but may just use S) ... X experimental ie expect funny results. ... Z probably rubbish. etc. THEN I may add text after that. Just extracting the basic filename + date/time will help ignore the add ons. BUT, to keep life busier, I modify the file names :-(. Sony use DSCxxxxx.yyy The xxxxx COULD HAVE given them a 100,000 file range, but they (stupidly) actually only implement DSC0xxxx.yyy So, I change the 0 to an incrementing digit every time the camera rolls over 10,000 images. So DSC02345.JPG may become DSC32345.JPG. Which is fine as long as I don't copy the DSC0... files anywhere before renaming. On the camera cards they are still DSC0... so comparisons 'in camera' are complexified. What fun we have. > Google's Picasa > tools do something similar. But it ought to be possible to do tasks > outside the fancy GUIs a lot more efficiently, and especially so if > you add "intimate" knowledge of how the major photo programs organize > their data. I have so far avoided writing data to the available EXIF fields, but it looks like the day may be coming. This removes the very valuable ability to use 'visual inspection'. > The only thing I've heard of so far is some CLI utilities that will > let you do things based on the exif and other jpeg photo info, and > some people have written substantial scripts on top of these to > automatically organize photos as they're taken. That may yet be necessary, alas. > Let us know what you find out. Will do. I presently have "Duplicate Cleaner" running to see how it goes. On the onboard 1TB + 300 GB + backup 1 TB it reports 1023875 image files and after 0.4% processing complete has found 3728 "duplicate sets". If that is representative the final count will be about 100,000 "duplicate sets" on these 3 drives alone. As the 1TB external isa backup drive for some of the onboard photos that's not surprising - final figure may be higher depending on where it has looked so far. Russell -- http://www.piclist.com PIC/SX FAQ & list archive View/change your membership options at http://mailman.mit.edu/mailman/listinfo/piclist |
|
|
Re: Duplicate file locationApptech wrote:
>> The only thing I've heard of so far is some CLI utilities that will >> let you do things based on the exif and other jpeg photo info, and >> some people have written substantial scripts on top of these to >> automatically organize photos as they're taken. > > That may yet be necessary, alas. Unless you find what you're looking for, the next step could be looking for a tool that extracts EXIF data into something like CSV (or another format that you can load into a database) and use database organization capabilities to group the files as you want them grouped. Gerhard -- http://www.piclist.com PIC/SX FAQ & list archive View/change your membership options at http://mailman.mit.edu/mailman/listinfo/piclist |
|
|
Re: Duplicate file locationThere exists technology that will match photos based on their
content... I seem to recall some sort of 'fuzzy checksum' type methodology somewhere. It looks like 'image comparison' on google reveals some interesting hits. I also remember watching a presentation on one of the Dish Network research channels about some researchers doing some really cool stuff where they'd take photos from flikr which were tagged with a given location, and it would build a 3d model of the site, and automatically figure out where the picture was taken... I guess not applicable to your application, but seriously cool that they figured out how to take a whole pile of images and stich them together automatically to build the 3d model. Russell wrote: > Hopefully the obsessed photographers here will have at least a good idea about this subject as the experts on any photo list :-) : > > Summary: Seeking opinions re duplicate-file location and management software. Needs to work with essentially unlimited capacity and number of files and number of connected or disconnected drives. "Actually working" is highly desirable. Free would be nice but is not essential. Emphasis is on photo files. > _________________ > > I have a large photo collection scattered across many hard drives of various capacity and vintage. There are also DVD and CD backup copies although (wisely or not) in recent years I've tended to use multiple HDDs rather than DVDs for backup. The older the files the more likely that there are numerous "lost" copies. It would be nice to be able to relate & control the number of copies of a given original file. The copies may be identical or have different compressions or in some cases reduced resolutions. Copies will almost invariably share the same basic file name and date and time but may have extended names - usually as suffixes (added on end of original name) but in some cases also as prefixes (meaning ful prefix added at start of name)(and/or even a date coded prefix added.) All copies of a file that matter will share the original EXIF information. Original date/time is preserved to the maximum extent possible**. (Some copying or editing processes* destroy EXIF informat! io! > n but I try to avoid this whereever possible).(* eg Photoshop has a nasty habit of playing with the EXIF).(** Some programs attempt to apply current date/time despite one's intentions and some (like eg Microsft Windows itself !!!) tend to corrupt the date/time on transfer from eg storage media. It is not uncommon for Windows itself to not be able to properly handle the time/date format of files coming from eg cameras or flash cards and to move the time 12 hours or 1 day or swap day/month or play other games. In such cases the EXIF is usually untouched and can be used to subsequently restore the correct values). (Microsoft have produced software to address this shortcoming in their own products but using it may require you to be embedded in their surrounding cotton wool). . > > At various stages the control of backups has been reasonably well controlled but occasionally Murphy sneaks in and ... . It is likely that there are at least two copies of every file available. In some cases there are many many more copies. (eg a wedding will have a master copy and a working copy and then subsets and versions and ...). > > I'd like to tidy up a vast number of files covering many GB of capacity and numerous hard drives. Possibly also various DVDs. > Some but not all the drives will be live at once. Drive sizes are typically in the 300GB - 1 TB range but may be down to ?30GB?. > > Total originals are probably in the 100,000 - 1 million range, intended backups at least double the count, working copies probably add 50% more (only major 'events' get substantial work) and the serendipitous copying (disc upgrades etc) add some more. A total file count under 10 million would probably suffice - and it may be much lower than that. > > It is almost certain that no existing duplicate management system will meet my need and I'll have to write something myself. But, it would be nice to hear what people do to manage such large collections. "Doing it properly" (apart from doing it from the start) probably involves correlation of core file names, date/time > > > > Russell > -- http://www.piclist.com PIC/SX FAQ & list archive View/change your membership options at http://mailman.mit.edu/mailman/listinfo/piclist |
|
|
|
|
|
Re: Duplicate file locationWhy do not you guys use CVS or SVN so that you can commit/submit files
onto the repository each time you make modifications on the photo (so one as the original version, then each time you adjust / crop / whatever the picture you can commit the changes so then you will have the backup + the version system at the same time. After this you can backup the repository as many times as you want it to using a traditional backup system. Tamas On Wed, Oct 28, 2009 at 12:27 PM, Lee Jones <lee@...> wrote: >> Hopefully the obsessed photographers here > > Ahhh, so OP is obsessed photographer not original poster. :-) > > >> Summary: Seeking opinions re duplicate-file location and management >> software. Needs to work with essentially unlimited capacity and number >> of files and number of connected or disconnected drives. "Actually >> working" is highly desirable. Free would be nice but is not essential. >> Emphasis is on photo files. > > I'm very interested in solving that same problem. I have various > scripts (messy, keyed to my idiosyncrasies) and ways of (trying to) > ensure that I have at least 2 copies of each original photo on > seperate devices. I'm looking for more of an archive manager. > > I also have ideas & an outline for a program to deal with it but > it's nowhere near completion (and unfunded development). Biggest > stumbling block is coming up with a robust signature mechanism to > identify nearly-identical or derivative works -- e.g. differentiate > between a JPEG that's been recompressed (i.e. identical) versus > one that has been edited (i.e. new work). > > Matching bit-wise identical files is not trivial (partly due to > file sizes) but not too difficult either. It also has to take care > of files that moved between directory tree A and directory tree B. > I think I've solved both of these parts. > > >> I have a large photo collection scattered across many hard drives of >> various capacity and vintage. There are also DVD and CD backup copies >> although (wisely or not) in recent years I've tended to use multiple >> HDDs rather than DVDs for backup. The older the files the more likely >> that there are numerous "lost" copies. > > Wow, that exactly describes my environment too. :-) > > >> All copies of a file that matter will share the original EXIF >> information. Original date/time is preserved to the maximum extent >> possible**. (Some copying or editing processes* destroy EXIF > > Identical EXIF is likely but not always true (more below). > > I think in terms of 2 categories -- archive and work. Original > images (now Raw, formerly JPEG) are always archive. (They _are_ > my digital negatives.) Proofs sent to clients are work files. > Final, printer ready files are archive. Reduced resolution or > alternate crops, usually client-approval samples, are usually > work files but may be archive class. > > >> It is not uncommon for Windows itself to not be able to properly >> handle the time/date format of files coming from eg cameras or >> flash cards and to move the time 12 hours or 1 day or swap >> day/month or play other games. In such cases the EXIF is usually >> untouched and [...] subsequently restore the correct values). > > Is the correct time correct? One niggling problem is air travel. > Well, actually, it's my inability to always perfectly execute a > pre-planned process. > > I keep the camera on the departure time zone through the flight > (arbitrary decision, use as convention). Upon landing, I change > the camera's time stamp to match current local time. I've found > doing that makes it much easier to correlate photos with notes in > my time planner, receipts, planned events, etc -- helps me to > identify a photo's where, what, and (sometimes) why. > > Sometimes I forget to update one or all camera bodies on landing. > Or I'm just too busy taking photos during deplanning and in the > airport (or talking to security because I'm taking photos) -- and > then I forget. Having to keep a "time adjustment" for a batch of > pictures is another issue that I'd like to address. > > But if I edit the EXIF time stamps, then the "original" and the > "adjusted original" photo are identical but not bit-wise identical. > It goes back to a mechanism to compute a unique signature. > > In no case am I willing to adjust the EXIF time stamp before I > upload the original capture files from the camera cards to other > media (onto 2 devices or 1 device & a backup DVD). DVD backups, > obviously, can't be "fixed" if I update the EXIF time stamps. > > And, of course, my photo rate goes up by a multipler (circa 10X) > when I'm traveling -- likely a direct correlation between number > of time zones changed and number of extra photos taken. > > >> Total originals are probably in the 100,000 - 1 million range, > > I'm at the lower part of that range. You're crazy... well, crazier. :-) > > Lee Jones > > -- > http://www.piclist.com PIC/SX FAQ & list archive > View/change your membership options at > http://mailman.mit.edu/mailman/listinfo/piclist > -- /* www.mcuhobby.com */ int main() { char *a,*s,*q; printf(s="/* www.mcuhobby.com */ int main() { char *a,*s,*q; printf(s=%s%s%s, q=%s%s%s%s,s,q,q,a=%s%s%s%s,q,q,q,a,a,q); }", q="\"",s,q,q,a="\\",q,q,q,a,a,q); } -- http://www.piclist.com PIC/SX FAQ & list archive View/change your membership options at http://mailman.mit.edu/mailman/listinfo/piclist |
|
|
Re: Duplicate file location> There exists technology that will match photos based on their
> content... I seem to recall some sort of 'fuzzy checksum' type > methodology somewhere. It looks like 'image comparison' on google > reveals some interesting hits. That would be good BUT I have in some cases got many very similar photos. eg I got up at 5:30 to walk up to the the Long Ji rice terraces before sunrise, took many hundreds of photos, walked back down and had breakfast with my wife and then we went up again and i took hundreds more :-). I suspect that image matching software would be hard put to separate that lot out ;-). > I also remember watching a presentation on one of the Dish Network > research channels about some researchers doing some really cool stuff > where they'd take photos from flikr which were tagged with a given > location, and it would build a 3d model of the site, and automatically > figure out where the picture was taken... I guess not applicable to your > application, but seriously cool that they figured out how to take a > whole pile of images and stich them together automatically to build the > 3d model. The 3D site builder would have a ball though. I stupidly did not GPS log the location and have not been able to locate it with certainty on Google Earth - resolution in that area is too low. Russell -- http://www.piclist.com PIC/SX FAQ & list archive View/change your membership options at http://mailman.mit.edu/mailman/listinfo/piclist |
|
|
Re: Duplicate file locationRussell wrote:
> Hopefully the obsessed photographers here will have at least a good idea about this subject as the experts on any photo list :-) : > > Summary: Seeking opinions re duplicate-file location and management software. Needs to work with essentially unlimited capacity and number of files and number of connected or disconnected drives. "Actually working" is highly desirable. Free would be nice but is not essential. Emphasis is on photo files. > _________________ > > I have a large photo collection scattered across many hard drives of various capacity and vintage. ..... Hi Russell. So, what works for me.... and it is somewhat recently implemented (in the past few years....).... like you I found I have stuff everywhere. It does not sound quite as bad as your situation, but only perhaps because my volume was somewhat less. I have managed to use a few tools to get things nice and tidy again. Primary among them are Linux, jhead, rsync, and some custom scripts and Java programs. jhead is a command-line tool does a fair number of things, but all I really use it for is to: 1. alter the file name to represent the date/time the image was taken... for example, DSC4321.JPG becomes Img20091025.123456.78.jpg for a photo taken at 73 milliseconds in to the second at 2009/10/25 12:34:56. It does this by using the EXIF data. 2. I have also modified jhead slightly to be able to simultaneously rename the RAW file (if any exists) associated with the JPG to have the same name but different extension. 3. jhead is also instructed to inspect the orientation data in the EXIF, and it will losslessly rotate JPG images if they were taken portrait style... I have used the above to re-process all my previously messy files, but I also use it now as part of the routine workflow for loading pictures from my camera to my server. I group the photos in to 'batches' where, normally a batch is what I download from each memory card.... each batch is in a different folder, and the folders are stored in a calendar year folder, and named sequentially/chronologically.... i.e. I have .../2009/Batch001/Img20090101.000501.20.jpg for that photo taken 5 minutes after the new year.... So, the new system is one where the photos are loaded off the camera, pre-processed to rotate, rename, file, (and set all permissions to read-only). The new batch is then loaded in to Gallery2 web server (which I have also slightly modified to ignore processing RAW photos...), and it loads each photo, creates a couple of smaller image sizes for each photo, and symbolically links the 'full size' image to the main photo repository. Then, I manipulate things in the Gallery2 web app to build photo albums of 'good' photos. I have then written a Java program that you can call... it inspects these 'real' albums, and, from that, it can track back the real photos that the album uses, and it pulls out the RAW photo with the same name as the JPG in Gallery2. I can then re-touch, and batch-process the RAW files for physical prints in a physical album, or for enlargements. The actual folder with the original folders in is read-only exported using Samba to my other windows machines. being read-only is a good thing because otherwise Windows will mess with the files, etc. If I retouch images, I normally don't pay special attention to the changed file, and delete it after it is used for enlargements/prints, etc. I can always re-retouch the pics from the originals. Some special re-works get re-filed with the originals manually, and then manually re-imported to Gallery2. For the most part my re-touching is pretty simple and easy to reproduce. Once the photos are all consistently named, and (normally) stored in a separate folders which I have named sequentially in each calendar year.... it became a whole lot easier to start filing the 'messy' photos in with the existing photos, and removing duplicates, etc. Again, having established this system, I was able then to go back and re-process all the messy/scattered files and take each set of messy files, re-process them using jhead, etc, and see if there were duplicate names, etc. In some obvious cases I was able to completely discard wads of photos as duplicates, in other cases I had to do some manual merging, and in other cases I decided to skip the manual processing and add a separate 'batch' and live with duplicate sets of 'originals'. Now that I have a system going, I am also able to keep all the photos, and only photos in a 'valuable' folder. Because Gallery2 symbolically links to the photos the Gallery2 re-sized files for web-viewing are not in the same place. I then use rsync in an automated system to keep frequent (hourly) snapshot-like backups of all photos. The 'originals' are stored on two 1TB drives which are set up in a RAID0 (striped) array for performance reasons but this introduces significant risk of drive failure. So, every hour all photos are re-synced on to a TAID1 set of 1.5TB drives (mirrored). The sync is 'intelligent', in that it keeps a history of all file modifications (there are none for photos except new photos being added), but the same system is also used for other documents and settings which change. As a result of this system I have a copy of the state of every file for every hour in the past day, every day in the past week, every week in the past 3 months, every month in the past year, and every year since the system started..... Me deleting a file or making a stupid change should be easy to recover from. Even formatting my drive means at most an hour of work lost.... Once every month I connect an external 1.5TB drive and completely re-sync my complete historical backup. This disk is then stored safely offsite ... ;-) I submitted patches for jhead but they were not added to the tool, but if you want I can happily send you the changes I have made (adding millisecond granularity to timestamps, processing RAW files along with the JPG, etc.), and I can share other details and code. The system works for me in the sense that I now have about 78K photos occupying about 560Gig. For the record, I will never back up to CD/DVD/etc again. Hard-drives only for me. The same drives are used for other data like email, documents, music, etc, I have 800Gig of data actively processed through the system regularly using the same backup and storage system. Because my data is stored on a central server I have also invested in Gigabit networks and, with the RAID0 system used for storage, it is actually faster for me to process data over the network than to process it locally. I do in fact max out the network at around 80MB/s when I am transferring files around for upload or editing. Also, because the 'server' does a lot of photo processing when uploading photos, and because gallery2 is somewhat intensive, I have a pretty beefy server (quad-core with 8Gig mem, etc.) and this helps a lot to make the web site 'slick'. It almost embarrasses me that my server got out of control... 6TB of storage permanently attached, with 3TB on hand (2 external backup 1.5T drives)... with a beefy network, CPU, memory, etc. But, it does the job really really well, and I am 'comfortable' with my data in that it is as safe as I can afford at a relatively low price, all in all.... just over $1K CAD, and no cost in software.... I just wish I could find a good software solution for photo-editing and processing the Nikon RAW files I work off in Linux.... Anyway, I though you may want to hear how I do it. Rolf -- http://www.piclist.com PIC/SX FAQ & list archive View/change your membership options at http://mailman.mit.edu/mailman/listinfo/piclist |
|
|
Re: Duplicate file locationRussell,
have you seen these guys? <http://www.digikam.org/drupal/> it is a linux program, but there are people using it on Win. may not be everything you want, but I only produce a couple hundred photos a month, and so far Digikam seems the best option for me. I've tried picasa, gallery2, and plain folder filling with rsync. The site includes wiki and help files, that could be searched for features you need. regards -- ...AndrewB -- http://www.piclist.com PIC/SX FAQ & list archive View/change your membership options at http://mailman.mit.edu/mailman/listinfo/piclist |
|
|
Re: Duplicate file location:: <http://www.digikam.org/drupal/> it is a linux program, but :: there are :: people using it on Win. I like Digikam, one of only two reasons I keep a Linux partition on my laptop. Even understands Canon raw files. Colin -- cdb, colin@... on 10/30/2009 Web presence: www.btech-online.co.uk Hosted by: www.1and1.co.uk/?k_id=7988359 -- http://www.piclist.com PIC/SX FAQ & list archive View/change your membership options at http://mailman.mit.edu/mailman/listinfo/piclist |
| Free embeddable forum powered by Nabble | Forum Help |