|
View:
New views
16 Messages
—
Rating Filter:
Alert me
|
|
|
Logfile ManipulationI've got a large amount of data in the form of 3 apache and 3 varnish
logfiles from 3 different machines. They are rotated at 0400. The logfiles are pretty big - maybe 6G per server, uncompressed. I've got to produce a combined logfile for 0000-2359 for a given day, with a bit of filtering (removing lines based on text match, bit of substitution). I've inherited a nasty shell script that does this but it is very slow and not clean to read or understand. I'd like to reimplement this in python. Initial questions: * How does Python compare in performance to shell, awk etc in a big pipeline? The shell script kills the CPU * What's the best way to extract the data for a given time, eg 0000 - 2359 yesterday? Any advice or experiences? S. -- Stephen Nelson-Smith Technical Director Atalanta Systems Ltd www.atalanta-systems.com _______________________________________________ Tutor maillist - Tutor@... To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor |
|
|
Re: Logfile Manipulation"Stephen Nelson-Smith" <sanelson@...> wrote
> * How does Python compare in performance to shell, awk etc in a big > pipeline? The shell script kills the CPU Python should be significantly faster than the typical shell script and it should consume less resources, although it will probably still use a fair bit of CPU unless you nice it. > * What's the best way to extract the data for a given time, eg 0000 - > 2359 yesterday? I'm not familiar with Apache log files so I'll let somebody else answer, but I suspect you can either use string.split() or a re.findall(). You might even be able to use csv. Or if they are in XML you could use ElementTree. It all depends on the data! -- Alan Gauld Author of the Learn to Program web site http://www.alan-g.me.uk/ _______________________________________________ Tutor maillist - Tutor@... To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor |
|
|
Re: Logfile ManipulationOn Mon, Nov 9, 2009 at 8:47 AM, Alan Gauld <alan.gauld@...> wrote:
> I'm not familiar with Apache log files so I'll let somebody else answer, > but I suspect you can either use string.split() or a re.findall(). You might > even be able to use csv. Or if they are in XML you could use ElementTree. > It all depends on the data! An apache logfile entry looks like this: 89.151.119.196 - - [04/Nov/2009:04:02:10 +0000] "GET /service.php?s=nav&arg[]=&arg[]=home&q=ubercrumb/node%2F20812 HTTP/1.1" 200 50 "-" "-" I want to extract 24 hrs of data based timestamps like this: [04/Nov/2009:04:02:10 +0000] I also need to do some filtering (eg I actually don't want anything with service.php), and I also have to do some substitutions - that's trivial other than not knowing the optimum place to do it? IE should I do multiple passes? Or should I try to do all the work at once, only viewing each line once? Also what about reading from compressed files? The data comes in as 6 gzipped logfiles which expand to 6G in total. S. _______________________________________________ Tutor maillist - Tutor@... To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor |
|
|
Re: Logfile Manipulation> An apache logfile entry looks like this: > >89.151.119.196 - - [04/Nov/2009:04:02:10 +0000] "GET > /service.php?s=nav&arg[]=&arg[]=home&q=ubercrumb/node%2F20812 > HTTP/1.1" 200 50 "-" "-" > >I want to extract 24 hrs of data based timestamps like this: > > [04/Nov/2009:04:02:10 +0000] OK It looks like you could use a regex to extract the first thing you find between square brackets. Then convert that to a time. > I also need to do some filtering (eg I actually don't want anything > with service.php), That's easy enough to detect. > and I also have to do some substitutions - that's >
trivial other than not knowing the optimum place to do it? Assuming they are trivial then... > I do multiple passes? Or should I try to do all the work at once, I'd opt for doing it all in one pass. With such large files you really want to minimise the amount of time spent reading the file. Plus with such large files you will
need/want to process them line by line anyway rather than reading the whole thing into memory. > Also what about reading from compressed files? > The data comes in as 6 gzipped logfiles Python has a module for that but I've never used it. BTW A quick google reveals that there are several packages for handling Apache log files. That is probably worth investigating before you start writing lots of code... Examples: Loghetti: an apache log file filter in Python - O'Reilly ONLamp Blog18 Mar 2008 ... Loghetti: an apache log file filter in Python ... This causes loghetti to parsethe query string, and return
lines where the query parameter ... www.oreillynet.com/.../blog/.../loghetti_an_apache_log_file_fi.html - Alan G. _______________________________________________ Tutor maillist - Tutor@... To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor |
|
|
|
|
|
Re: Logfile Manipulation-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1 Hello, : An apache logfile entry looks like this: : : 89.151.119.196 - - [04/Nov/2009:04:02:10 +0000] "GET : /service.php?s=nav&arg[]=&arg[]=home&q=ubercrumb/node%2F20812 : HTTP/1.1" 200 50 "-" "-" : : I want to extract 24 hrs of data based timestamps like this: : : [04/Nov/2009:04:02:10 +0000] : : I also need to do some filtering (eg I actually don't want : anything with service.php), and I also have to do some : substitutions - that's trivial other than not knowing the optimum : place to do it? IE should I do multiple passes? I wouldn't. Then, you spend decompression CPU, line matching CPU and I/O several times. I'd do it all at once. : Or should I try to do all the work at once, only viewing each : line once? Also what about reading from compressed files? The : data comes in as 6 gzipped logfiles which expand to 6G in total. There are standard modules for handling compressed data (gzip and bz2). I'd imagine that the other pythonistas on this list will give you more detailed (and probably better) advice, but here's a sample of how to use the gzip module and how to skip the lines containing the '/service.php' string, and to extract an epoch timestamp from the datestamp field(s). You would pass the filenames to operate on as arguments to this script. See optparse if you want fancier capabilities for option handling. See re if you want to match multiple patterns to ignore. See time (and datetime) for mangling time and date strings. Be forewarned, time zone issues will probably be a massive headache. Many others have been here before [0]. Look up itertools (and be prepared for some study) if you want the output from the log files from your different servers sorted in the output. Note that the below snippet is a toy and makes no attempt to trap (try/except) any error conditions. If you are looking for a weblog analytics package once you have reassambled the files into a whole, perhaps you could just start there (e.g. webalizer, analog are two old-school packages that come to mind for processing logging that has been produced in a Common Log Format). I will echo Alan Gauld's sentiments of a few minutes ago and note that there are a probably many different Apache log parsers out there which can accomplish what you hope to accomplish. On the other hand, you may be using this as an excuse to learn a bit of python. Good luck, - -Martin [0] http://seehuhn.de/blog/52 Sample: import sys, time, gzip files = sys.argv[1:] for file in files: print >>sys.stderr, "About to open %s" % ( file ) f = gzip.open( file ) for line in f: if line.find('/service.php') > 0: continue fields = line.split() # -- ignoring time zone; you are logging in UTC, right? # tz = fields[4] d = int( time.mktime( time.strptime(fields[3], "[%d/%b/%Y:%H:%M:%S") ) ) print d, line, - -- Martin A. Brown http://linux-ip.net/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) Comment: pgf-0.72 (http://linux-ip.net/sw/pine-gpg-filter/) iD8DBQFK9+MGHEoZD1iZ+YcRAhITAKCLGF6GnEMYr50bgk4vAw3YMRZjuACg2VUg I7/Vrw6KKjwqfxG0qfr10lo= =oi6X -----END PGP SIGNATURE----- _______________________________________________ Tutor maillist - Tutor@... To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor |
|
|
Re: Logfile ManipulationOn Mon, Nov 9, 2009 at 4:36 AM, Stephen Nelson-Smith <sanelson@...> wrote:
>>>>I want to extract 24 hrs of data based timestamps like this: >>>> >>>> [04/Nov/2009:04:02:10 +0000] >>> >>> OK It looks like you could use a regex to extract the first >>> thing you find between square brackets. Then convert that to a time. >> >> I'm currently thinking I can just use a string comparison after the >> first entry for the day - that saves date arithmetic. As long as the times are all in the same time zone. >> How do I handle concurrency? I have 6 log files which I need to turn >> into one time-sequenced log. >> >> I guess I need to switch between each log depending on whether the >> next entry is the next chronological entry between all six. Then on a >> per line basis I can also reject it if it matches the stuff I want to >> throw out, and substitute it if I need to, then write out to the new >> file. If you create iterators from the files that yield (timestamp, entry) pairs, you can merge the iterators using one of these recipes: http://code.activestate.com/recipes/491285/ http://code.activestate.com/recipes/535160/ Kent _______________________________________________ Tutor maillist - Tutor@... To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor |
|
|
Re: Logfile ManipulationOn Sun, Nov 8, 2009 at 11:41 PM, Stephen Nelson-Smith <sanelson@...> wrote:
I've got a large amount of data in the form of 3 apache and 3 varnish go here and download the pdf! Someone posted this the other day, and I went and read through it and played around a bit and it's exactly what you're looking for - plus it has one vs. slide of python vs. awk.
I think you'll find the pdf highly useful and right on. HTH, Wayne _______________________________________________ Tutor maillist - Tutor@... To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor |
|
|
Re: Logfile ManipulationHi,
>> Any advice or experiences? >> > > go here and download the pdf! > http://www.dabeaz.com/generators-uk/ > Someone posted this the other day, and I went and read through it and played > around a bit and it's exactly what you're looking for - plus it has one vs. > slide of python vs. awk. > I think you'll find the pdf highly useful and right on. Looks like generators are a really good fit. My biggest question really is how to multiplex. I have 6 logs per day, so I don't know how which one will have the next consecutive entry. I love teh idea of making a big dictionary, but with 6G of data, that's going to run me out of memory, isn't it S. _______________________________________________ Tutor maillist - Tutor@... To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor |
|
|
Re: Logfile ManipulationStephen Nelson-Smith wrote:
> Hi, > > >>> Any advice or experiences? >>> >>> >> go here and download the pdf! >> http://www.dabeaz.com/generators-uk/ >> Someone posted this the other day, and I went and read through it and played >> around a bit and it's exactly what you're looking for - plus it has one vs. >> slide of python vs. awk. >> I think you'll find the pdf highly useful and right on. >> > > Looks like generators are a really good fit. My biggest question > really is how to multiplex. > > I have 6 logs per day, so I don't know how which one will have the > next consecutive entry. > > I love teh idea of making a big dictionary, but with 6G of data, > that's going to run me out of memory, isn't it > > Perhaps "lookahead" generators could help? Though that would be getting into advanced territory: http://stackoverflow.com/questions/1517862/using-lookahead-with-generators _______________________________________________ Tutor maillist - Tutor@... To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor |
|
|
Re: Logfile ManipulationHi,
> If you create iterators from the files that yield (timestamp, entry) > pairs, you can merge the iterators using one of these recipes: > http://code.activestate.com/recipes/491285/ > http://code.activestate.com/recipes/535160/ Could you show me how I might do that? So far I'm at the stage of being able to produce loglines: #! /usr/bin/env python import gzip class LogFile: def __init__(self, filename, date): self.f=gzip.open(filename,"r") for logline in self.f: self.line=logline self.stamp=" ".join(self.line.split()[3:5]) if self.stamp.startswith(date): break def getline(self): ret=self.line self.line=self.f.readline() self.stamp=" ".join(self.line.split()[3:5]) return ret logs=[LogFile("a/access_log-20091105.gz","[05/Nov/2009"),LogFile("b/access_log-20091105.gz","[05/Nov/2009"),LogFile("c/access_log-20091105.gz","[05/Nov/2009")] while True: print [x.stamp for x in logs] nextline=min((x.stamp,x) for x in logs) print nextline[1].getline() -- Stephen Nelson-Smith Technical Director Atalanta Systems Ltd www.atalanta-systems.com _______________________________________________ Tutor maillist - Tutor@... To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor |
|
|
Re: Logfile ManipulationAnd the problem I have with the below is that I've discovered that the
input logfiles aren't strictly ordered - ie there is variance by a second or so in some of the entries. I can sort the biggest logfile (800M) using unix sort in about 1.5 mins on my workstation. That's not really fast enough, with potentially 12 other files.... Hrm... S. On Mon, Nov 9, 2009 at 1:35 PM, Stephen Nelson-Smith <sanelson@...> wrote: > Hi, > >> If you create iterators from the files that yield (timestamp, entry) >> pairs, you can merge the iterators using one of these recipes: >> http://code.activestate.com/recipes/491285/ >> http://code.activestate.com/recipes/535160/ > > Could you show me how I might do that? > > So far I'm at the stage of being able to produce loglines: > > #! /usr/bin/env python > import gzip > class LogFile: > def __init__(self, filename, date): > self.f=gzip.open(filename,"r") > for logline in self.f: > self.line=logline > self.stamp=" ".join(self.line.split()[3:5]) > if self.stamp.startswith(date): > break > > def getline(self): > ret=self.line > self.line=self.f.readline() > self.stamp=" ".join(self.line.split()[3:5]) > return ret > > logs=[LogFile("a/access_log-20091105.gz","[05/Nov/2009"),LogFile("b/access_log-20091105.gz","[05/Nov/2009"),LogFile("c/access_log-20091105.gz","[05/Nov/2009")] > while True: > print [x.stamp for x in logs] > nextline=min((x.stamp,x) for x in logs) > print nextline[1].getline() > > > -- > Stephen Nelson-Smith > Technical Director > Atalanta Systems Ltd > www.atalanta-systems.com > -- Stephen Nelson-Smith Technical Director Atalanta Systems Ltd www.atalanta-systems.com _______________________________________________ Tutor maillist - Tutor@... To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor |
|
|
Re: Logfile ManipulationOn Mon, Nov 9, 2009 at 7:46 AM, Stephen Nelson-Smith <sanelson@...> wrote:
And the problem I have with the below is that I've discovered that the Within a given set of 10 lines, is the first line and last line "in order" - i.e. 1 2 4
3 5 8 7 6 9 10 I can sort the biggest logfile (800M) using unix sort in about 1.5 If that's the case, then I'm pretty sure you can create sort of a queue system, and it should probably cut down on the sorting time. I don't know what the default python sorting algorithm is on a list, but AFAIK you'd be looking at a constant O(log 10) time on each insertion by doing something such as this:
log_generator = (d for d in logdata) mylist = # first ten values while True: try: mylist.sort() nextdata = mylist.pop(0)
mylist.append(log_generator.next()) except StopIteration: print 'done' #Do something with nextdata Or now that I look, python has a priority queue ( http://docs.python.org/library/heapq.html ) that you could use instead. Just push the next value into the queue and pop one out - you give it some initial qty - 10 or so, and then it will always give you the smallest value.
HTH, Wayne _______________________________________________ Tutor maillist - Tutor@... To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor |
|
|
Re: Logfile Manipulation> I can sort the biggest logfile (800M) using unix sort in about 1.5 > mins on my workstation. That's not really fast enough, with > potentially 12 other files.... You won't beat sort with Python. You have to be realistic, these are very big files! Python should be faster overall but for specific tasks the Unix tools written in C will be faster. But if you are merging multiple files into one then sorting them before processing will probably help. However if you expect to be pruning out more lines than you keep it might be easier just to throw all the data you want into a single file and then sort that at the end. It all depends on the data. HTH, Alan G _______________________________________________ Tutor maillist - Tutor@... To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor |
|
|
Re: Logfile ManipulationOn Mon, Nov 9, 2009 at 3:15 PM, Wayne Werner <waynejwerner@...> wrote:
> On Mon, Nov 9, 2009 at 7:46 AM, Stephen Nelson-Smith <sanelson@...> > wrote: >> >> And the problem I have with the below is that I've discovered that the >> input logfiles aren't strictly ordered - ie there is variance by a >> second or so in some of the entries. > > Within a given set of 10 lines, is the first line and last line "in order" - On average, in a sequence of 10 log lines, one will be out by one or two seconds. Here's a random slice: 05/Nov/2009:01:41:37 05/Nov/2009:01:41:37 05/Nov/2009:01:41:37 05/Nov/2009:01:41:37 05/Nov/2009:01:41:36 05/Nov/2009:01:41:37 05/Nov/2009:01:41:37 05/Nov/2009:01:41:37 05/Nov/2009:01:41:37 05/Nov/2009:01:41:37 05/Nov/2009:01:41:37 05/Nov/2009:01:41:37 05/Nov/2009:01:41:36 05/Nov/2009:01:41:37 05/Nov/2009:01:41:37 05/Nov/2009:01:41:38 05/Nov/2009:01:41:38 05/Nov/2009:01:41:37 05/Nov/2009:01:41:38 05/Nov/2009:01:41:38 05/Nov/2009:01:41:38 05/Nov/2009:01:41:38 05/Nov/2009:01:41:37 05/Nov/2009:01:41:38 05/Nov/2009:01:41:36 05/Nov/2009:01:41:38 05/Nov/2009:01:41:38 05/Nov/2009:01:41:38 05/Nov/2009:01:41:38 05/Nov/2009:01:41:39 05/Nov/2009:01:41:38 05/Nov/2009:01:41:39 05/Nov/2009:01:41:39 05/Nov/2009:01:41:39 05/Nov/2009:01:41:39 05/Nov/2009:01:41:40 05/Nov/2009:01:41:40 05/Nov/2009:01:41:41 > I don't know > what the default python sorting algorithm is on a list, but AFAIK you'd be > looking at a constant O(log 10) I'm not a mathematician - what does this mean, in layperson's terms? > log_generator = (d for d in logdata) > mylist = # first ten values OK > while True: > try: > mylist.sort() OK - sort the first 10 values. > nextdata = mylist.pop(0) So the first value... > mylist.append(log_generator.next()) Right, this will add another one value? > except StopIteration: > print 'done' > Or now that I look, python has a priority queue ( > http://docs.python.org/library/heapq.html ) that you could use instead. Just > push the next value into the queue and pop one out - you give it some > initial qty - 10 or so, and then it will always give you the smallest value. That sounds very cool - and I see that one of the activestate recipes Kent suggested uses heapq too. I'll have a play. S. -- Stephen Nelson-Smith Technical Director Atalanta Systems Ltd www.atalanta-systems.com _______________________________________________ Tutor maillist - Tutor@... To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor |
|
|
Re: Logfile Manipulation> > what the default python sorting algorithm is on a list, but AFAIK you'd be > > looking at a constant O(log 10) > > I'm not a mathematician - what does this mean, in layperson's terms? O(log10) is a way of expressing the efficiency of an algorithm. Its execution time is proportional (in the Order of) log10. That is the time to process 100 items will be about twice the time to process 10 (rather than 100 times), since log(100) is 2 and log(10) is 1 These measures are indicative only, but the bottom line is that it will scale to high volumes better than a linear algorithm would. HTH, Alan G. _______________________________________________ Tutor maillist - Tutor@... To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor |
| Free embeddable forum powered by Nabble | Forum Help |