Duplicate file location

View: New views
11 Messages — Rating Filter:   Alert me  

Duplicate file location

by Russell McMahon-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hopefully the obsessed photographers here will have at least a good idea about this subject as the experts on any photo list :-) :

Summary:  Seeking opinions re duplicate-file location and management software. Needs to work with essentially unlimited capacity and number of files and number of connected or disconnected drives. "Actually working" is highly desirable. Free would be nice but is not essential. Emphasis is on photo files.
_________________

I have a large photo collection scattered across many hard drives of various capacity and vintage. There are also DVD and CD backup copies although (wisely or not) in recent years I've tended to use multiple HDDs rather than DVDs for backup. The older the files the more likely that there are numerous "lost" copies. It would be nice to be able to relate & control the number of copies of a given original file. The copies may be identical or have different compressions or in some cases reduced resolutions. Copies will almost invariably share the same basic file name and date and time but may have extended names - usually as suffixes (added on end of original name) but in some cases also as prefixes (meaning ful prefix added at start of name)(and/or even a date coded prefix added.) All copies of a file that matter will share the original EXIF information. Original date/time is preserved to the maximum extent possible**. (Some copying or editing processes* destroy EXIF informatio!
 n but I try to avoid this whereever possible).(* eg Photoshop has a nasty habit of playing with the EXIF).(** Some programs attempt to apply current date/time despite one's intentions and some (like eg Microsft Windows itself !!!) tend to corrupt the date/time on transfer from eg storage media. It is not uncommon for Windows itself to not be able to properly handle the time/date format of files coming from eg cameras or flash cards and to move the time 12 hours or 1 day or swap day/month or play other games. In such cases the EXIF is usually untouched and can be used to subsequently restore the correct values). (Microsoft have produced software to address this shortcoming in their own products but using it may require you to be embedded in their surrounding cotton wool). .

At various stages the control of backups has been reasonably well controlled but occasionally Murphy sneaks in and ... . It is likely that there are at least two copies of every file available. In some cases there are many many more copies. (eg a wedding will have a master copy and a working copy and then subsets and versions and ...).

I'd like to tidy up a vast number of files covering many GB of capacity and numerous hard drives. Possibly also various DVDs.
Some but not all the drives will be live at once. Drive sizes are typically in the 300GB - 1 TB range but may be down to ?30GB?.

Total originals are probably in the 100,000 - 1 million range, intended backups at least double the count, working copies probably add 50% more (only major 'events' get substantial work) and the serendipitous copying (disc upgrades etc) add some more. A total file count under 10 million would probably suffice - and it may be much lower than that.

It is almost certain that no existing duplicate management system will meet my need and I'll have to write something myself. But, it would be nice to hear what people do to manage such large collections. "Doing it properly" (apart from doing it from the start) probably involves correlation of core file names, date/time



 Russell
--
http://www.piclist.com PIC/SX FAQ & list archive
View/change your membership options at
http://mailman.mit.edu/mailman/listinfo/piclist

Re: Duplicate file location

by William "Chops" Westfield :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Oct 28, 2009, at 8:31 PM, Russell wrote:

> It is almost certain that no existing duplicate management system  
> will meet my need and I'll have to write something myself. But, it  
> would be nice to hear what people do to manage such large  
> collections. "Doing it properly" (apart from doing it from the  
> start) probably involves correlation of core file names, date/time

Yeah, I have often thought that photos (and probably videos too) have  
enough meta-content that it ought to be possible to put together some  
really nice utilities for non-visual tasks like collection merging,  
duplicate elimination, backup (duplicate creation, eh?)  Some of the  
organization packages do a pretty good job of this just by virtue of  
the way they organize their photo database (ie iphoto puts things in  
directories by date of photo taken, which is pretty helpful for  
detecting duplicates. Two files of the same name and size that were  
taken on the same date most likely are duplicates...)  Google's Picasa  
tools do something similar.  But it ought to be possible to do tasks  
outside the fancy GUIs a lot more efficiently, and especially so if  
you add "intimate" knowledge of how the major photo programs organize  
their data.

The only thing I've heard of so far is some CLI utilities that will  
let you do things based on the exif and other jpeg photo info, and  
some people have written substantial scripts on top of these to  
automatically organize photos as they're taken.

Let us know what you find out.

BillW

--
http://www.piclist.com PIC/SX FAQ & list archive
View/change your membership options at
http://mailman.mit.edu/mailman/listinfo/piclist

Re: Duplicate file location

by Russell McMahon-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Some of the
> organization packages do a pretty good job of this just by virtue of
> the way they organize their photo database (ie iphoto puts things in
> directories by date of photo taken, which is pretty helpful for
> detecting duplicates. Two files of the same name and size that were
> taken on the same date most likely are duplicates...)

If files are unmodified then matching is not hard.
But there are lots of derivatives or lookalikes that arise along the way
that it would be nice to associate together. I don't necessarily want to
delete duplicates or derivatives - mainly I want to know that they exist and
where they are.

Sving an image from an editor, even if notionally unmodifed, will change
file size unless you specifically save the identical file. Even at very low
loss JPG compression the file size reduces substantially from the camera JPG
file size so this is useful when providing DVDs etc for people to view. eg a
4 MB camera JPG may reduce to around 1.5 MB at JPG 90. While you can
possibly see the difference in quality it usually requires flicker
comparison (alternate between notionally identical images). I have spent
alot of time cropping sections out of images and pasting them ext to each
other and poring over them at pixel per display pixel resolution or above to
see what differences occur. It is almost always extremely subtle at say
JPG90 and when viewed with a 12MP image on a 1.5 MP or so LCD screen the
differences are essentially or actually non-existent.

If using RAW files an editor may produce a reduce dresolution JPG from the
embedded JPG and this will have the original file name unless care is taken
to differentiate it.
eg camera may produce RAW[ActualRAW file, 1.5 MP JPG] + 12 MP JPG.

When I edit files I have a semiformal naming convention which adds suffixes
to the file name
R  modified (usually just colour balance/ contrast / saturation / brightness
/ gamma)
RC modified and cropped.
Rn or RCn   N = 1 2 3 ... for variants
...S with sharpening (may just use R)
... usm - unsharp mask (but may just use S)
... X   experimental ie expect funny results.
 ... Z probably rubbish.
etc.
THEN I may add text after that.

Just extracting the basic filename + date/time will help ignore the add ons.

BUT, to keep life busier, I modify the file names :-(.
Sony use DSCxxxxx.yyy
The xxxxx COULD HAVE given them a 100,000 file range, but they (stupidly)
actually only implement DSC0xxxx.yyy
So, I change the 0 to an incrementing digit every time the camera rolls over
10,000 images.
So DSC02345.JPG may become DSC32345.JPG. Which is fine as long as I don't
copy the DSC0... files anywhere before renaming.
On the camera cards they are still DSC0... so comparisons 'in camera' are
complexified.

What fun we have.

> Google's Picasa
> tools do something similar.  But it ought to be possible to do tasks
> outside the fancy GUIs a lot more efficiently, and especially so if
> you add "intimate" knowledge of how the major photo programs organize
> their data.

I have so far avoided writing data to the available EXIF fields, but it
looks like the day may be coming. This removes the very valuable ability to
use 'visual inspection'.

> The only thing I've heard of so far is some CLI utilities that will
> let you do things based on the exif and other jpeg photo info, and
> some people have written substantial scripts on top of these to
> automatically organize photos as they're taken.

That may yet be necessary, alas.

> Let us know what you find out.

Will do.

I presently have "Duplicate Cleaner" running to see how it goes.
On the onboard 1TB + 300 GB + backup 1 TB it reports 1023875 image files and
after 0.4% processing complete has found 3728 "duplicate sets". If that is
representative the final count will be about 100,000 "duplicate sets" on
these 3 drives alone. As the 1TB external isa backup drive for some of the
onboard photos that's not surprising - final figure may be higher depending
on where it has looked so far.

 Russell

--
http://www.piclist.com PIC/SX FAQ & list archive
View/change your membership options at
http://mailman.mit.edu/mailman/listinfo/piclist

Re: Duplicate file location

by Gerhard Fiedler :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Apptech wrote:

>> The only thing I've heard of so far is some CLI utilities that will
>> let you do things based on the exif and other jpeg photo info, and
>> some people have written substantial scripts on top of these to
>> automatically organize photos as they're taken.
>
> That may yet be necessary, alas.

Unless you find what you're looking for, the next step could be looking
for a tool that extracts EXIF data into something like CSV (or another
format that you can load into a database) and use database organization
capabilities to group the files as you want them grouped.

Gerhard
--
http://www.piclist.com PIC/SX FAQ & list archive
View/change your membership options at
http://mailman.mit.edu/mailman/listinfo/piclist

Re: Duplicate file location

by Forrest W Christian-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

There exists technology that will match photos based on their
content...  I seem to recall some sort of 'fuzzy checksum' type
methodology somewhere.   It looks like 'image comparison' on google
reveals some interesting hits.

I also remember watching a presentation on one of the Dish Network
research channels about some researchers doing some really cool stuff
where they'd take photos from flikr which were tagged with a given
location, and it would build a 3d model of the site, and automatically
figure out where the picture was taken... I guess not applicable to your
application, but seriously cool that they figured out how to take a
whole pile of images and stich them together automatically to build the
3d model.  

Russell wrote:
> Hopefully the obsessed photographers here will have at least a good idea about this subject as the experts on any photo list :-) :
>
> Summary:  Seeking opinions re duplicate-file location and management software. Needs to work with essentially unlimited capacity and number of files and number of connected or disconnected drives. "Actually working" is highly desirable. Free would be nice but is not essential. Emphasis is on photo files.
> _________________
>
> I have a large photo collection scattered across many hard drives of various capacity and vintage. There are also DVD and CD backup copies although (wisely or not) in recent years I've tended to use multiple HDDs rather than DVDs for backup. The older the files the more likely that there are numerous "lost" copies. It would be nice to be able to relate & control the number of copies of a given original file. The copies may be identical or have different compressions or in some cases reduced resolutions. Copies will almost invariably share the same basic file name and date and time but may have extended names - usually as suffixes (added on end of original name) but in some cases also as prefixes (meaning ful prefix added at start of name)(and/or even a date coded prefix added.) All copies of a file that matter will share the original EXIF information. Original date/time is preserved to the maximum extent possible**. (Some copying or editing processes* destroy EXIF informat!
 io!

>  n but I try to avoid this whereever possible).(* eg Photoshop has a nasty habit of playing with the EXIF).(** Some programs attempt to apply current date/time despite one's intentions and some (like eg Microsft Windows itself !!!) tend to corrupt the date/time on transfer from eg storage media. It is not uncommon for Windows itself to not be able to properly handle the time/date format of files coming from eg cameras or flash cards and to move the time 12 hours or 1 day or swap day/month or play other games. In such cases the EXIF is usually untouched and can be used to subsequently restore the correct values). (Microsoft have produced software to address this shortcoming in their own products but using it may require you to be embedded in their surrounding cotton wool). .
>
> At various stages the control of backups has been reasonably well controlled but occasionally Murphy sneaks in and ... . It is likely that there are at least two copies of every file available. In some cases there are many many more copies. (eg a wedding will have a master copy and a working copy and then subsets and versions and ...).
>
> I'd like to tidy up a vast number of files covering many GB of capacity and numerous hard drives. Possibly also various DVDs.
> Some but not all the drives will be live at once. Drive sizes are typically in the 300GB - 1 TB range but may be down to ?30GB?.
>
> Total originals are probably in the 100,000 - 1 million range, intended backups at least double the count, working copies probably add 50% more (only major 'events' get substantial work) and the serendipitous copying (disc upgrades etc) add some more. A total file count under 10 million would probably suffice - and it may be much lower than that.
>
> It is almost certain that no existing duplicate management system will meet my need and I'll have to write something myself. But, it would be nice to hear what people do to manage such large collections. "Doing it properly" (apart from doing it from the start) probably involves correlation of core file names, date/time
>
>
>
>  Russell
>  



--
http://www.piclist.com PIC/SX FAQ & list archive
View/change your membership options at
http://mailman.mit.edu/mailman/listinfo/piclist

Parent Message unknown Re: Duplicate file location

by Lee Jones :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Hopefully the obsessed photographers here

Ahhh, so OP is obsessed photographer not original poster. :-)


> Summary:  Seeking opinions re duplicate-file location and management
> software. Needs to work with essentially unlimited capacity and number
> of files and number of connected or disconnected drives. "Actually
> working" is highly desirable. Free would be nice but is not essential.
> Emphasis is on photo files.

I'm very interested in solving that same problem.  I have various
scripts (messy, keyed to my idiosyncrasies) and ways of (trying to)
ensure that I have at least 2 copies of each original photo on
seperate devices.  I'm looking for more of an archive manager.

I also have ideas & an outline for a program to deal with it but
it's nowhere near completion (and unfunded development).  Biggest
stumbling block is coming up with a robust signature mechanism to
identify nearly-identical or derivative works -- e.g. differentiate
between a JPEG that's been recompressed (i.e. identical) versus
one that has been edited (i.e. new work).

Matching bit-wise identical files is not trivial (partly due to
file sizes) but not too difficult either.  It also has to take care
of files that moved between directory tree A and directory tree B.
I think I've solved both of these parts.


> I have a large photo collection scattered across many hard drives of
> various capacity and vintage. There are also DVD and CD backup copies
> although (wisely or not) in recent years I've tended to use multiple
> HDDs rather than DVDs for backup. The older the files the more likely
> that there are numerous "lost" copies.

Wow, that exactly describes my environment too. :-)


> All copies of a file that matter will share the original EXIF
> information. Original date/time is preserved to the maximum extent
> possible**. (Some copying or editing processes* destroy EXIF

Identical EXIF is likely but not always true (more below).

I think in terms of 2 categories -- archive and work.  Original
images (now Raw, formerly JPEG) are always archive.  (They _are_
my digital negatives.)  Proofs sent to clients are work files.
Final, printer ready files are archive.  Reduced resolution or
alternate crops, usually client-approval samples, are usually
work files but may be archive class.


> It is not uncommon for Windows itself to not be able to properly
> handle the time/date format of files coming from eg cameras or
> flash cards and to move the time 12 hours or 1 day or swap
> day/month or play other games. In such cases the EXIF is usually
> untouched and [...] subsequently restore the correct values).

Is the correct time correct?  One niggling problem is air travel.
Well, actually, it's my inability to always perfectly execute a
pre-planned process.

I keep the camera on the departure time zone through the flight
(arbitrary decision, use as convention).  Upon landing, I change
the camera's time stamp to match current local time.  I've found
doing that makes it much easier to correlate photos with notes in
my time planner, receipts, planned events, etc -- helps me to
identify a photo's where, what, and (sometimes) why.

Sometimes I forget to update one or all camera bodies on landing.
Or I'm just too busy taking photos during deplanning and in the
airport (or talking to security because I'm taking photos) -- and
then I forget.  Having to keep a "time adjustment" for a batch of
pictures is another issue that I'd like to address.

But if I edit the EXIF time stamps, then the "original" and the
"adjusted original" photo are identical but not bit-wise identical.
It goes back to a mechanism to compute a unique signature.

In no case am I willing to adjust the EXIF time stamp before I
upload the original capture files from the camera cards to other
media (onto 2 devices or 1 device & a backup DVD).  DVD backups,
obviously, can't be "fixed" if I update the EXIF time stamps.

And, of course, my photo rate goes up by a multipler (circa 10X)
when I'm traveling -- likely a direct correlation between number
of time zones changed and number of extra photos taken.


> Total originals are probably in the 100,000 - 1 million range,

I'm at the lower part of that range.  You're crazy... well, crazier. :-)

                                                Lee Jones

--
http://www.piclist.com PIC/SX FAQ & list archive
View/change your membership options at
http://mailman.mit.edu/mailman/listinfo/piclist

Re: Duplicate file location

by Tamas Rudnai :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Why do not you guys use CVS or SVN so that you can commit/submit files
onto the repository each time you make modifications on the photo (so
one as the original version, then each time you adjust / crop /
whatever the picture you can commit the changes so then you will have
the backup + the version system at the same time. After this you can
backup the repository as many times as you want it to using a
traditional backup system.

Tamas


On Wed, Oct 28, 2009 at 12:27 PM, Lee Jones <lee@...> wrote:

>> Hopefully the obsessed photographers here
>
> Ahhh, so OP is obsessed photographer not original poster. :-)
>
>
>> Summary:  Seeking opinions re duplicate-file location and management
>> software. Needs to work with essentially unlimited capacity and number
>> of files and number of connected or disconnected drives. "Actually
>> working" is highly desirable. Free would be nice but is not essential.
>> Emphasis is on photo files.
>
> I'm very interested in solving that same problem.  I have various
> scripts (messy, keyed to my idiosyncrasies) and ways of (trying to)
> ensure that I have at least 2 copies of each original photo on
> seperate devices.  I'm looking for more of an archive manager.
>
> I also have ideas & an outline for a program to deal with it but
> it's nowhere near completion (and unfunded development).  Biggest
> stumbling block is coming up with a robust signature mechanism to
> identify nearly-identical or derivative works -- e.g. differentiate
> between a JPEG that's been recompressed (i.e. identical) versus
> one that has been edited (i.e. new work).
>
> Matching bit-wise identical files is not trivial (partly due to
> file sizes) but not too difficult either.  It also has to take care
> of files that moved between directory tree A and directory tree B.
> I think I've solved both of these parts.
>
>
>> I have a large photo collection scattered across many hard drives of
>> various capacity and vintage. There are also DVD and CD backup copies
>> although (wisely or not) in recent years I've tended to use multiple
>> HDDs rather than DVDs for backup. The older the files the more likely
>> that there are numerous "lost" copies.
>
> Wow, that exactly describes my environment too. :-)
>
>
>> All copies of a file that matter will share the original EXIF
>> information. Original date/time is preserved to the maximum extent
>> possible**. (Some copying or editing processes* destroy EXIF
>
> Identical EXIF is likely but not always true (more below).
>
> I think in terms of 2 categories -- archive and work.  Original
> images (now Raw, formerly JPEG) are always archive.  (They _are_
> my digital negatives.)  Proofs sent to clients are work files.
> Final, printer ready files are archive.  Reduced resolution or
> alternate crops, usually client-approval samples, are usually
> work files but may be archive class.
>
>
>> It is not uncommon for Windows itself to not be able to properly
>> handle the time/date format of files coming from eg cameras or
>> flash cards and to move the time 12 hours or 1 day or swap
>> day/month or play other games. In such cases the EXIF is usually
>> untouched and [...] subsequently restore the correct values).
>
> Is the correct time correct?  One niggling problem is air travel.
> Well, actually, it's my inability to always perfectly execute a
> pre-planned process.
>
> I keep the camera on the departure time zone through the flight
> (arbitrary decision, use as convention).  Upon landing, I change
> the camera's time stamp to match current local time.  I've found
> doing that makes it much easier to correlate photos with notes in
> my time planner, receipts, planned events, etc -- helps me to
> identify a photo's where, what, and (sometimes) why.
>
> Sometimes I forget to update one or all camera bodies on landing.
> Or I'm just too busy taking photos during deplanning and in the
> airport (or talking to security because I'm taking photos) -- and
> then I forget.  Having to keep a "time adjustment" for a batch of
> pictures is another issue that I'd like to address.
>
> But if I edit the EXIF time stamps, then the "original" and the
> "adjusted original" photo are identical but not bit-wise identical.
> It goes back to a mechanism to compute a unique signature.
>
> In no case am I willing to adjust the EXIF time stamp before I
> upload the original capture files from the camera cards to other
> media (onto 2 devices or 1 device & a backup DVD).  DVD backups,
> obviously, can't be "fixed" if I update the EXIF time stamps.
>
> And, of course, my photo rate goes up by a multipler (circa 10X)
> when I'm traveling -- likely a direct correlation between number
> of time zones changed and number of extra photos taken.
>
>
>> Total originals are probably in the 100,000 - 1 million range,
>
> I'm at the lower part of that range.  You're crazy... well, crazier. :-)
>
>                                                Lee Jones
>
> --
> http://www.piclist.com PIC/SX FAQ & list archive
> View/change your membership options at
> http://mailman.mit.edu/mailman/listinfo/piclist
>



--
/* www.mcuhobby.com */ int main() { char *a,*s,*q; printf(s="/*
www.mcuhobby.com */ int main() { char *a,*s,*q; printf(s=%s%s%s,
q=%s%s%s%s,s,q,q,a=%s%s%s%s,q,q,q,a,a,q); }",
q="\"",s,q,q,a="\\",q,q,q,a,a,q); }

--
http://www.piclist.com PIC/SX FAQ & list archive
View/change your membership options at
http://mailman.mit.edu/mailman/listinfo/piclist

Re: Duplicate file location

by Russell McMahon-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> There exists technology that will match photos based on their
> content...  I seem to recall some sort of 'fuzzy checksum' type
> methodology somewhere.   It looks like 'image comparison' on google
> reveals some interesting hits.

That would be good BUT I have in some cases got many very similar photos. eg
I got up at 5:30 to walk up to the the Long Ji rice terraces before sunrise,
took many hundreds of photos, walked back down and had breakfast with my
wife and then we went up again and i took hundreds more :-).  I suspect that
image matching software would be hard put to separate that lot out ;-).

> I also remember watching a presentation on one of the Dish Network
> research channels about some researchers doing some really cool stuff
> where they'd take photos from flikr which were tagged with a given
> location, and it would build a 3d model of the site, and automatically
> figure out where the picture was taken... I guess not applicable to your
> application, but seriously cool that they figured out how to take a
> whole pile of images and stich them together automatically to build the
> 3d model.

The 3D site builder would have a ball though. I stupidly did not GPS log the
location and have not been able to locate it with certainty on Google
Earth - resolution in that area is too low.



Russell

--
http://www.piclist.com PIC/SX FAQ & list archive
View/change your membership options at
http://mailman.mit.edu/mailman/listinfo/piclist

Re: Duplicate file location

by Rolf-20 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Russell wrote:
> Hopefully the obsessed photographers here will have at least a good idea about this subject as the experts on any photo list :-) :
>
> Summary:  Seeking opinions re duplicate-file location and management software. Needs to work with essentially unlimited capacity and number of files and number of connected or disconnected drives. "Actually working" is highly desirable. Free would be nice but is not essential. Emphasis is on photo files.
> _________________
>
> I have a large photo collection scattered across many hard drives of various capacity and vintage.
.....

Hi Russell.

So, what works for me.... and it is somewhat recently implemented (in
the past few years....).... like you I found I have stuff everywhere. It
does not sound quite as bad as your situation, but only perhaps because
my volume was somewhat less.

I have managed to use a few tools to get things nice and tidy again.
Primary among them are Linux, jhead, rsync, and some custom scripts and
Java programs.

jhead is a command-line tool does a fair number of things, but all I
really use it for is to:
1. alter the file name to represent the date/time the image was taken...
for example, DSC4321.JPG becomes Img20091025.123456.78.jpg for a photo
taken at 73 milliseconds in to the second at 2009/10/25 12:34:56. It
does this by using the EXIF data.
2. I have also modified jhead slightly to be able to simultaneously
rename the RAW file (if any exists) associated with the JPG to have the
same name but different extension.
3. jhead is also instructed to inspect the orientation data in the EXIF,
and it will losslessly rotate JPG images if they were taken portrait
style...

I have used the above to re-process all my previously messy files, but I
also use it now as part of the routine workflow for loading pictures
from my camera to my server.

I group the photos in to 'batches' where, normally a batch is what I
download from each memory card.... each batch is in a different folder,
and the folders are stored in a calendar year folder, and named
sequentially/chronologically....

i.e. I have .../2009/Batch001/Img20090101.000501.20.jpg for that photo
taken 5 minutes after the new year....

So, the new system is one where the photos are loaded off the camera,
pre-processed to rotate, rename, file, (and set all permissions to
read-only).

The new batch is then loaded in to Gallery2 web server (which I have
also slightly modified to ignore processing RAW photos...), and it loads
each photo, creates a couple of smaller image sizes for each photo, and
symbolically links the 'full size' image to the main photo repository.

Then, I manipulate things in the Gallery2 web app to build photo albums
of 'good' photos. I have then written a Java program that you can
call... it inspects these 'real' albums, and, from that, it can track
back the real photos that the album uses, and it pulls out the RAW photo
with the same name as the JPG in Gallery2. I can then re-touch, and
batch-process the RAW files for physical prints in a physical album, or
for enlargements.

The actual folder with the original folders in is read-only exported
using Samba to my other windows machines. being read-only is a good
thing because otherwise Windows will mess with the files, etc.

If I retouch images, I normally don't pay special attention to the
changed file, and delete it after it is used for enlargements/prints,
etc. I can always re-retouch the pics from the originals. Some special
re-works get re-filed with the originals manually, and then manually
re-imported to Gallery2. For the most part my re-touching is pretty
simple and easy to reproduce.

Once the photos are all consistently named, and (normally) stored in a
separate folders which I have named sequentially in each calendar
year.... it became a whole lot easier to start filing the 'messy' photos
in with the existing photos, and removing duplicates, etc.

Again, having established this system, I was able then to go back and
re-process all the messy/scattered files and take each set of messy
files, re-process them using jhead, etc, and see if there were duplicate
names, etc. In some obvious cases I was able to completely discard wads
of photos as duplicates, in other cases I had to do some manual merging,
and in other cases I decided to skip the manual processing and add a
separate 'batch' and live with duplicate sets of 'originals'.

Now that I have a system going, I am also able to keep all the photos,
and only photos in a 'valuable' folder. Because Gallery2 symbolically
links to the photos the Gallery2 re-sized files for web-viewing are not
in the same place.

I then use rsync in an automated system to keep frequent (hourly)
snapshot-like backups of all photos.

The 'originals' are stored on two 1TB drives which are set up in a RAID0
(striped) array for performance reasons but this introduces significant
risk of drive failure.

So, every hour all photos are re-synced on to a TAID1 set of 1.5TB
drives (mirrored). The sync is 'intelligent', in that it keeps a history
of all file modifications (there are none for photos except new photos
being added), but the same system is also used for other documents and
settings which change. As a result of this system I have a copy of the
state of every file for every hour in the past day, every day in the
past week, every week in the past 3 months, every month in the past
year, and every year since the system started.....

Me deleting a file or making a stupid change should be easy to recover
from. Even formatting my drive means at most an hour of work lost....

Once every month I connect an external 1.5TB drive and completely
re-sync my complete historical backup. This disk is then stored safely
offsite ... ;-)

I submitted patches for jhead but they were not added to the tool, but
if you want I can happily send you the changes I have made (adding
millisecond granularity to timestamps, processing RAW files along with
the JPG, etc.), and I can share other details and code.

The system works for me in the sense that I now have about 78K photos
occupying about 560Gig. For the record, I will never back up to
CD/DVD/etc again. Hard-drives only for me.

The same drives are used for other data like email, documents, music,
etc, I have 800Gig of data actively processed through the system
regularly using the same backup and storage system.

Because my data is stored on a central server I have also invested in
Gigabit networks and, with the RAID0 system used for storage, it is
actually faster for me to process data over the network than to process
it locally. I do in fact max out the network at around 80MB/s when I am
transferring files around for upload or editing.

Also, because the 'server' does a lot of photo processing when uploading
photos, and because gallery2 is somewhat intensive, I have a pretty
beefy server (quad-core with 8Gig mem, etc.) and this helps a lot to
make the web site 'slick'. It almost embarrasses me that my server got
out of control... 6TB of storage permanently attached, with 3TB on hand
(2 external backup 1.5T drives)... with a beefy network, CPU, memory,
etc. But, it does the job really really well, and I am 'comfortable'
with my data in that it is as safe as I can afford at a relatively low
price, all in all.... just over $1K CAD, and no cost in software....

I just wish I could find a good software solution for photo-editing and
processing the Nikon RAW files I work off in Linux....

Anyway, I though you may want to hear how I do it.


Rolf
--
http://www.piclist.com PIC/SX FAQ & list archive
View/change your membership options at
http://mailman.mit.edu/mailman/listinfo/piclist

Re: Duplicate file location

by Andrew Burchill-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Russell,
have you seen these guys?
<http://www.digikam.org/drupal/>   it is a linux program, but there are
people using it on Win.
may not be everything you want, but I only produce a couple hundred photos a
month, and so far Digikam seems the best option for me. I've tried picasa,
gallery2, and plain folder filling with rsync.
The site includes wiki and help files, that could be searched for features
you need.

regards
--
...AndrewB
--
http://www.piclist.com PIC/SX FAQ & list archive
View/change your membership options at
http://mailman.mit.edu/mailman/listinfo/piclist

Re: Duplicate file location

by CDB-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



:: <http://www.digikam.org/drupal/>   it is a linux program, but
:: there are
:: people using it on Win.

I like Digikam, one of only two reasons I keep a Linux partition on my
laptop. Even understands Canon raw files.

Colin
--
cdb, colin@... on 10/30/2009
 
Web presence: www.btech-online.co.uk  
 
Hosted by:  www.1and1.co.uk/?k_id=7988359
 

 
 
 
 

--
http://www.piclist.com PIC/SX FAQ & list archive
View/change your membership options at
http://mailman.mit.edu/mailman/listinfo/piclist