Synchronize with batch size

View: New views
4 Messages — Rating Filter:   Alert me  

Synchronize with batch size

by Allan Frank-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi

Let me first say great work on the jets3t package; loads of
functionality that allows me to use S3 without much work.

I wanted to discuss/ask something with the synchronize program that I
discovered while using it to backup my files to S3. I have 10000+ files
that needs to be synchronized to S3, so I setup synchronize to do this
to a bucket in S3. When run, it starts to first crypt all 10000+ files
(which takes ages and same amount of diskspace) - this is already listed
in the documentation, so no surprise there, but I wanted to see if the
situation could be improved. I looked in the sourcecode and modified
synchronize to upload the files in batches of n files. This solves the
waiting issue - files are uploaded as soon as n files are crypted, but
not the diskspace problem as the JVMs option of deleting the temporary
file on JVM exit is used. I have looked for a way to clean up the files
after uploading, but the temporary file is hard to get at because of all
the nested stream objects.

I have included my patch here - it uses a property to allow the user to
specify the upload batch size. I realize that the problem is smaller on
the initial upload as all files need to be uploaded, but I think running
synchronize on large amount of files is something to optimize for - what
do you say?

Also, I'd like to have some pointers towards how cleaning the temporary
files could be done - my only idea is to save the temporary file name in
the s3object and then deleting that file in a cleanup() method.

Again, thanks for the good work,

/Allan



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@...
For additional commands, e-mail: dev-help@...


Re: Synchronize with batch size

by James Murty-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Allan,

Thanks for your recommendations, and in particular the patch you provided.

I have just committed changes to Synchronize based on your patch, such that file uploads that involve data transformation (ie. compression or encryption) are batched. This means that the transformation action and upload task will be performed in batches, with the size of the batches set by 'upload.transformed-files-batch-size' in synchronize.properties.

I have also fixed the problem where temporary files hang around until the whole Synchronize process is finished. They should now be deleted immediately after being uploaded. There wasn't really a nice way to do this because Java's temporary files aren't identified as such after creation, so I had to create a TempFile placeholder class to recognize when it is safe to delete the uploaded file.

If you could check out the latest code from CVS and test it out, that would be fantastic. I'm especially keen to get feedback and squash bugs because I'm preparing for the next major release of JetS3t (version 0.7.0)

Cheers,
James

---
http://www.jamesmurty.com


On Sun, Jan 4, 2009 at 3:17 AM, Allan Frank <allan-javanet@...> wrote:
Hi

Let me first say great work on the jets3t package; loads of functionality that allows me to use S3 without much work.

I wanted to discuss/ask something with the synchronize program that I discovered while using it to backup my files to S3. I have 10000+ files that needs to be synchronized to S3, so I setup synchronize to do this to a bucket in S3. When run, it starts to first crypt all 10000+ files (which takes ages and same amount of diskspace) - this is already listed in the documentation, so no surprise there, but I wanted to see if the situation could be improved. I looked in the sourcecode and modified synchronize to upload the files in batches of n files. This solves the waiting issue - files are uploaded as soon as n files are crypted, but not the diskspace problem as the JVMs option of deleting the temporary file on JVM exit is used. I have looked for a way to clean up the files after uploading, but the temporary file is hard to get at because of all the nested stream objects.

I have included my patch here - it uses a property to allow the user to specify the upload batch size. I realize that the problem is smaller on the initial upload as all files need to be uploaded, but I think running synchronize on large amount of files is something to optimize for - what do you say?

Also, I'd like to have some pointers towards how cleaning the temporary files could be done - my only idea is to save the temporary file name in the s3object and then deleting that file in a cleanup() method.

Again, thanks for the good work,

/Allan



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@...
For additional commands, e-mail: dev-help@...



Re: Synchronize with batch size / data integrity

by Allan Frank-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, Jan 05, 2009 at 12:44:08AM +1100, James Murty wrote:

> I have just committed changes to Synchronize based on your patch, such that
> file uploads that involve data transformation (ie. compression or
> encryption) are batched. This means that the transformation action and
> upload task will be performed in batches, with the size of the batches set
> by 'upload.transformed-files-batch-size' in synchronize.properties.
>
> I have also fixed the problem where temporary files hang around until the
> whole Synchronize process is finished. They should now be deleted
> immediately after being uploaded. There wasn't really a nice way to do this
> because Java's temporary files aren't identified as such after creation, so
> I had to create a TempFile placeholder class to recognize when it is safe to
> delete the uploaded file.
>
Hi James

Nice!

I have been testing this out with batches of size 50, 1000 and with the property not set and it works perfectly. As soon as the file has been uploaded, the temporary file disappears. Larger batch sizes cause longer wait before starting to upload, which is nicer for the initial run where you want to see if it works.

I was thinking of wrapping synchronize in a script that would run it in the night and kill it in the morning if it was not done. While my tests (with kill and ctrl+c) of synchronize have not shown any data corruption as a result of the java process being killed, can data be partially uploaded if synchronize is interrupted during its run?
As I see it, only the original file md5 hash property is compared on the next run, so if the file is not complete it would never be discovered (until the restore that is :))

Thanks,

/Allan


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@...
For additional commands, e-mail: dev-help@...


Re: Synchronize with batch size / data integrity

by James Murty-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Alan,

I'm glad the Synchronize changes are working well for you.

Regarding partial uploads, due to the way S3 works there is no chance of partial uploads causing confusion. S3 will only store data when the entire upload is completed successfully; partial uploads are simply discarded. This is annoying if you need to upload very large files, but it means you can be sure of the integrity of any data that is uploaded and confirmed.

James


On Mon, Jan 12, 2009 at 9:27 PM, Allan Frank <allan-javanet@...> wrote:
On Mon, Jan 05, 2009 at 12:44:08AM +1100, James Murty wrote:
> I have just committed changes to Synchronize based on your patch, such that
> file uploads that involve data transformation (ie. compression or
> encryption) are batched. This means that the transformation action and
> upload task will be performed in batches, with the size of the batches set
> by 'upload.transformed-files-batch-size' in synchronize.properties.
>
> I have also fixed the problem where temporary files hang around until the
> whole Synchronize process is finished. They should now be deleted
> immediately after being uploaded. There wasn't really a nice way to do this
> because Java's temporary files aren't identified as such after creation, so
> I had to create a TempFile placeholder class to recognize when it is safe to
> delete the uploaded file.
>
Hi James

Nice!

I have been testing this out with batches of size 50, 1000 and with the property not set and it works perfectly. As soon as the file has been uploaded, the temporary file disappears. Larger batch sizes cause longer wait before starting to upload, which is nicer for the initial run where you want to see if it works.

I was thinking of wrapping synchronize in a script that would run it in the night and kill it in the morning if it was not done. While my tests (with kill and ctrl+c) of synchronize have not shown any data corruption as a result of the java process being killed, can data be partially uploaded if synchronize is interrupted during its run?
As I see it, only the original file md5 hash property is compared on the next run, so if the file is not complete it would never be discovered (until the restore that is :))

Thanks,

/Allan


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@...
For additional commands, e-mail: dev-help@...