log files...suppression

View: New views
4 Messages — Rating Filter:   Alert me  

log files...suppression

by doughenderson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello,
We have been working happily with Jets3t and synchronize (primarily) for several months
now and find most things to be working just fine. There is one item I must do something
about, and I hope someone has been there, done that already. When we first started our
synchronize jobs too an hour or two to complete. But as time went by and the log files
grew the synchronize runs run towards 5 hours to complete with over 14,000 files.

My question is, "Has anyone modified synchronize to ignore log files?" It seems to be
spending most of it's time looking at them vs. synchronizing the files. If not, I will happily
go explore the source files and work on it. Promise to report back my findings, and
would take suggestions as to where to look to make the change.

Thanks,

Doug

Re: log files...suppression

by James Murty-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Doug,

If I understand correctly, your sync process is being delayed because your bucket contains a large number of S3 log files and Sychronize has to list all these to decide what to upload?

The problem is that Synchronize needs to list and examine the contents of the target S3 bucket or path before it can determine what action to take, and listing large numbers of objects can take a long time. You could modify the source code to ignore certain files (e.g. log files) but it would still take time to obtain a listing of all these files even if their details are not examined.

I think the best solution would be to partition your uploads into different path locations in S3. If Synchronize is given a path within a bucket it will only list the contents at that path and will skip over everything else, without having to list or examine all the other objects in the bucket. If you can partition your bucket contents based on some kind of naming scheme, such as by year & month or customer name, you can divide up your bucket into manageable portions while still being able to find objects when you need them.

You can apply a path to your Synchronize uploads by adding a suffix to your bucket name in the command like this:

synchronize.sh UP bucket-name/2009/09 /path/to/files

Hope this helps,
James


On Tue, Sep 29, 2009 at 4:48 PM, doughenderson <dhenderson@...> wrote:

Hello,
We have been working happily with Jets3t and synchronize (primarily) for
several months
now and find most things to be working just fine. There is one item I must
do something
about, and I hope someone has been there, done that already. When we first
started our
synchronize jobs too an hour or two to complete. But as time went by and the
log files
grew the synchronize runs run towards 5 hours to complete with over 14,000
files.

My question is, "Has anyone modified synchronize to ignore log files?" It
seems to be
spending most of it's time looking at them vs. synchronizing the files. If
not, I will happily
go explore the source files and work on it. Promise to report back my
findings, and
would take suggestions as to where to look to make the change.

Thanks,

Doug
--
View this message in context: http://www.nabble.com/log-files...suppression-tp25672836p25672836.html
Sent from the JetS3t Development mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@...
For additional commands, e-mail: dev-help@...



Re: log files...suppression

by doughenderson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello James,

Yes, that is correct.
The unforeseen problem was that we would have so many
log files as to be unmanageable (in just a few months).
So our automation of all downloads in production is hard
to change (it's live) and a logging solution from a third party
has them in the root. One thought we had, although half-baked,
was to rename the log files zlog/... and use the your batch method
supplied, assuming that our product files would be updated
quicker in batches and also come to be processed before the
zlog files. Since the product name begin with 'P' and 'S' does
this seem possible? Perhaps I misunderstand the nature of the
synching and this won't work, but I don't mind if it takes 6 hours
as long as our files are found and processed in a few hours.

Thanks,

Doug

p.s. we will also investigate trimmng the oldest log files


James Murty-3 wrote:
Hi Doug,

If I understand correctly, your sync process is being delayed because your
bucket contains a large number of S3 log files and Sychronize has to list
all these to decide what to upload?

The problem is that Synchronize needs to list and examine the contents of
the target S3 bucket or path before it can determine what action to take,
and listing large numbers of objects can take a long time. You could modify
the source code to ignore certain files (e.g. log files) but it would still
take time to obtain a listing of all these files even if their details are
not examined.

I think the best solution would be to partition your uploads into different
path locations in S3. If Synchronize is given a path within a bucket it will
only list the contents at that path and will skip over everything else,
without having to list or examine all the other objects in the bucket. If
you can partition your bucket contents based on some kind of naming scheme,
such as by year & month or customer name, you can divide up your bucket into
manageable portions while still being able to find objects when you need
them.

You can apply a path to your Synchronize uploads by adding a suffix to your
bucket name in the command like this:

synchronize.sh UP bucket-name/2009/09 /path/to/files

Hope this helps,
James


On Tue, Sep 29, 2009 at 4:48 PM, doughenderson <dhenderson@extensis.com>wrote:

>
> Hello,
> We have been working happily with Jets3t and synchronize (primarily) for
> several months
> now and find most things to be working just fine. There is one item I must
> do something
> about, and I hope someone has been there, done that already. When we first
> started our
> synchronize jobs too an hour or two to complete. But as time went by and
> the
> log files
> grew the synchronize runs run towards 5 hours to complete with over 14,000
> files.
>
> My question is, "Has anyone modified synchronize to ignore log files?" It
> seems to be
> spending most of it's time looking at them vs. synchronizing the files. If
> not, I will happily
> go explore the source files and work on it. Promise to report back my
> findings, and
> would take suggestions as to where to look to make the change.
>
> Thanks,
>
> Doug
> --
> View this message in context:
> http://www.nabble.com/log-files...suppression-tp25672836p25672836.html
> Sent from the JetS3t Development mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@jets3t.dev.java.net
> For additional commands, e-mail: dev-help@jets3t.dev.java.net
>
>

Re: log files...suppression

by James Murty-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Doug,

If the log files you want to sync to S3 have different names from others you don't need to sync (or can recognized as being already synced) you can be quite selective about which files are uploaded by using wildcards to limit the files Synchronize pays attention to on the client side. For example:

synchronize.sh UP bucket-name P*.log

An even better solution might be possible if the log files include a date/timestamp component. Then you could focus on exactly the log files produced within a given time period and ignore everything else:

synchronize.sh UP bucket-name *2009-10-01*.log

Even if the files don't have timestamps, you could accomplish something similar by using the "find" command to locate files created within the last day -- assuming you are on a Linux/Unix system that has this command available -- and then provide the resulting file listing to Synchronize:

find /var/logs -ctime -1d | xargs sh synchronize.sh UP bucket-name

If you combine client-side filtering of files with some partitioning of your S3 bucket into different "subdirectory" paths you should be able to greatly speed up your synchronization process.

*** Just be sure to remember the --nodelete option whenever you use a Synchronize command that limits the file selection locally, otherwise Synchronize will remove files in S3 that were not included in the local file listing. ***

James


On Thu, Oct 1, 2009 at 3:09 PM, doughenderson <dhenderson@...> wrote:

Hello James,

Yes, that is correct.
The unforeseen problem was that we would have so many
log files as to be unmanageable (in just a few months).
So our automation of all downloads in production is hard
to change (it's live) and a logging solution from a third party
has them in the root. One thought we had, although half-baked,
was to rename the log files zlog/... and use the your batch method
supplied, assuming that our product files would be updated
quicker in batches and also come to be processed before the
zlog files. Since the product name begin with 'P' and 'S' does
this seem possible? Perhaps I misunderstand the nature of the
synching and this won't work, but I don't mind if it takes 6 hours
as long as our files are found and processed in a few hours.

Thanks,

Doug

p.s. we will also investigate trimmng the oldest log files



James Murty-3 wrote:
>
> Hi Doug,
>
> If I understand correctly, your sync process is being delayed because your
> bucket contains a large number of S3 log files and Sychronize has to list
> all these to decide what to upload?
>
> The problem is that Synchronize needs to list and examine the contents of
> the target S3 bucket or path before it can determine what action to take,
> and listing large numbers of objects can take a long time. You could
> modify
> the source code to ignore certain files (e.g. log files) but it would
> still
> take time to obtain a listing of all these files even if their details are
> not examined.
>
> I think the best solution would be to partition your uploads into
> different
> path locations in S3. If Synchronize is given a path within a bucket it
> will
> only list the contents at that path and will skip over everything else,
> without having to list or examine all the other objects in the bucket. If
> you can partition your bucket contents based on some kind of naming
> scheme,
> such as by year & month or customer name, you can divide up your bucket
> into
> manageable portions while still being able to find objects when you need
> them.
>
> You can apply a path to your Synchronize uploads by adding a suffix to
> your
> bucket name in the command like this:
>
> synchronize.sh UP bucket-name/2009/09 /path/to/files
>
> Hope this helps,
> James
>
>
> On Tue, Sep 29, 2009 at 4:48 PM, doughenderson
> <dhenderson@...>wrote:
>
>>
>> Hello,
>> We have been working happily with Jets3t and synchronize (primarily) for
>> several months
>> now and find most things to be working just fine. There is one item I
>> must
>> do something
>> about, and I hope someone has been there, done that already. When we
>> first
>> started our
>> synchronize jobs too an hour or two to complete. But as time went by and
>> the
>> log files
>> grew the synchronize runs run towards 5 hours to complete with over
>> 14,000
>> files.
>>
>> My question is, "Has anyone modified synchronize to ignore log files?" It
>> seems to be
>> spending most of it's time looking at them vs. synchronizing the files.
>> If
>> not, I will happily
>> go explore the source files and work on it. Promise to report back my
>> findings, and
>> would take suggestions as to where to look to make the change.
>>
>> Thanks,
>>
>> Doug
>> --
>> View this message in context:
>> http://www.nabble.com/log-files...suppression-tp25672836p25672836.html
>> Sent from the JetS3t Development mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@...
>> For additional commands, e-mail: dev-help@...
>>
>>
>
>

--
View this message in context: http://www.nabble.com/log-files...suppression-tp25672836p25707365.html
Sent from the JetS3t Development mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@...
For additional commands, e-mail: dev-help@...