|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
Synchronize with batch sizeHi
Let me first say great work on the jets3t package; loads of functionality that allows me to use S3 without much work. I wanted to discuss/ask something with the synchronize program that I discovered while using it to backup my files to S3. I have 10000+ files that needs to be synchronized to S3, so I setup synchronize to do this to a bucket in S3. When run, it starts to first crypt all 10000+ files (which takes ages and same amount of diskspace) - this is already listed in the documentation, so no surprise there, but I wanted to see if the situation could be improved. I looked in the sourcecode and modified synchronize to upload the files in batches of n files. This solves the waiting issue - files are uploaded as soon as n files are crypted, but not the diskspace problem as the JVMs option of deleting the temporary file on JVM exit is used. I have looked for a way to clean up the files after uploading, but the temporary file is hard to get at because of all the nested stream objects. I have included my patch here - it uses a property to allow the user to specify the upload batch size. I realize that the problem is smaller on the initial upload as all files need to be uploaded, but I think running synchronize on large amount of files is something to optimize for - what do you say? Also, I'd like to have some pointers towards how cleaning the temporary files could be done - my only idea is to save the temporary file name in the s3object and then deleting that file in a cleanup() method. Again, thanks for the good work, /Allan --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@... For additional commands, e-mail: dev-help@... |
|
|
Re: Synchronize with batch sizeHi Allan,
Thanks for your recommendations, and in particular the patch you provided. I have just committed changes to Synchronize based on your patch, such that file uploads that involve data transformation (ie. compression or encryption) are batched. This means that the transformation action and upload task will be performed in batches, with the size of the batches set by 'upload.transformed-files-batch-size' in synchronize.properties. I have also fixed the problem where temporary files hang around until the whole Synchronize process is finished. They should now be deleted immediately after being uploaded. There wasn't really a nice way to do this because Java's temporary files aren't identified as such after creation, so I had to create a TempFile placeholder class to recognize when it is safe to delete the uploaded file. If you could check out the latest code from CVS and test it out, that would be fantastic. I'm especially keen to get feedback and squash bugs because I'm preparing for the next major release of JetS3t (version 0.7.0) Cheers, James --- http://www.jamesmurty.com On Sun, Jan 4, 2009 at 3:17 AM, Allan Frank <allan-javanet@...> wrote: Hi |
|
|
Re: Synchronize with batch size / data integrityOn Mon, Jan 05, 2009 at 12:44:08AM +1100, James Murty wrote:
> I have just committed changes to Synchronize based on your patch, such that > file uploads that involve data transformation (ie. compression or > encryption) are batched. This means that the transformation action and > upload task will be performed in batches, with the size of the batches set > by 'upload.transformed-files-batch-size' in synchronize.properties. > > I have also fixed the problem where temporary files hang around until the > whole Synchronize process is finished. They should now be deleted > immediately after being uploaded. There wasn't really a nice way to do this > because Java's temporary files aren't identified as such after creation, so > I had to create a TempFile placeholder class to recognize when it is safe to > delete the uploaded file. > Nice! I have been testing this out with batches of size 50, 1000 and with the property not set and it works perfectly. As soon as the file has been uploaded, the temporary file disappears. Larger batch sizes cause longer wait before starting to upload, which is nicer for the initial run where you want to see if it works. I was thinking of wrapping synchronize in a script that would run it in the night and kill it in the morning if it was not done. While my tests (with kill and ctrl+c) of synchronize have not shown any data corruption as a result of the java process being killed, can data be partially uploaded if synchronize is interrupted during its run? As I see it, only the original file md5 hash property is compared on the next run, so if the file is not complete it would never be discovered (until the restore that is :)) Thanks, /Allan --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@... For additional commands, e-mail: dev-help@... |
|
|
Re: Synchronize with batch size / data integrityHi Alan,
I'm glad the Synchronize changes are working well for you. Regarding partial uploads, due to the way S3 works there is no chance of partial uploads causing confusion. S3 will only store data when the entire upload is completed successfully; partial uploads are simply discarded. This is annoying if you need to upload very large files, but it means you can be sure of the integrity of any data that is uploaded and confirmed. James On Mon, Jan 12, 2009 at 9:27 PM, Allan Frank <allan-javanet@...> wrote: On Mon, Jan 05, 2009 at 12:44:08AM +1100, James Murty wrote: |
| Free embeddable forum powered by Nabble | Forum Help |