|
View:
New views
8 Messages
—
Rating Filter:
Alert me
|
|
|
help with Haskell performance
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@... http://www.haskell.org/mailman/listinfo/haskell-cafe |
|
|
|
Re: help with Haskell performance2009/11/8 Gokul P. Nair <gpnair78@...>
> > Hi all, > > The task I'm trying to accomplish: > > Given a log file containing several lines of white space delimited entries like this: > > [Sat Oct 24 08:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not exist > [Sat Oct 24 08:12:37 2009] [error] GET /url2 HTTP/1.0]: Requested URI does not exist > [Sat Oct 24 08:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not exist > [Sat Oct 24 12:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not exist > > filter lines that match the string " 08:", extract the 6th, 7th and 8th words from that line, group all lines that have the the same resulting string, do a count on them and sort the result in descending order of counts and print it out. So in the example above we'd end up with an output like this: > > ("GET /url1 HTTP/1.1]:", 2) > ("GET /url2 HTTP/1.0]:", 1) > > Seems pretty straightforward, so I wrote a simple perl script to achieve this task (see the bottom of this email). > > The input file is 335 MB in size and contains about 2 million log line entires in it. The perl script does a pretty decent job and finishes in about 3 seconds. > > Now the interesting part. I decided to implememt this in Haskell (being my favorite language and all) and ended up with the following code: > > --- begin haskell code --- > > import Text.Regex.Posix ( (=~) ) First, you are using Text.Regex.Posix which is dog slow. Use regex-tdfa or regex-pcre. Second, you are using it on a String! Third, you are unpacking a lazy bytestring for that! This alone is more than enough to make things ridiculously slow. > > import qualified Data.List as List > import qualified Data.Map as Map > import qualified Data.ByteString.Lazy.Char8 as LB > > main = do > contents <- LB.readFile "log_file" > putStr . unlines . map ( show . (\(x, y) -> ((LB.unpack x), y)) ) . This piece of code also does unpacking. If the output is small, it's ok, otherwise assemble output in the form of a bytestring and print it with B.putStr. > -- create a Map grouping & counting matching tokens and sort based on the counts > List.sortBy (\(_, x) (_, y) -> y `compare` x) . Map.toList . Map.fromListWith (+) . filtertokens . The lambda can be replaced by "flip (comparing snd)" > LB.lines $ contents Here, you should not use Map.fromListWith (+) because Map is not strict in its entries and you end up having big fat thunks there. You should use Map.fromListWith plus where x `plus` y = x `seq` y `seq` x+y. > where filtertokens = foldr (\x acc -> if (f x) then ((g x) : acc) else acc) [] > -- filter lines starting with " 08:" > where f = (=~ " 08:") . LB.unpack > -- extract tokens 6, 7 & 8 and create an association list like so ("GET /url2 HTTP/1.0]:", 1) > g line = flip (,) 1 . LB.unwords . map (xs !!) $ [6, 7, 8] where xs = LB.words line You are using random access on a list three times in a row. map (xs!!) [6,7,8] is much faster when implemented as "take 3 (drop 6 xs)". So.. Well. Try these suggestions, show the resulting performance and then we'll see. > > --- end haskell code --- > > This haskell implementation takes a whopping 27 seconds to complete! About 9 times slower than the perl version! I'm using ghc 6.10.4, compiling with -O2 and even went to the extent of fusing an adjacent map and filter using a foldr like so: map f (filter g) => foldr ( if g x then f x ... ), fusing adjacents maps etc. Still the same result. > > I really hope I'm missing some obvious optimization that's making it so slow compared to the perl version, hence this email soliciting feedback. > > Thanks in advance. > > P.S. For reference, here's my corresponding perl implementation: > > --- start perl code --- > > #!/usr/bin/perl > use strict; > use warnings FATAL => 'all'; > > my %urls; > open(FILE, '<', $ARGV[0]); > while(<FILE>) { > if (/ 08:/) { > my @words = split; > my $key = join(" ", ($words[6], $words[7], $words[8])); > if (exists $urls{$key}) { $urls{$key}++ } > else { $urls{$key} = 1 } > } > } > for (sort { $urls{$b} <=> $urls{$a} } keys %urls) { print "($_, $urls{$_})\n" } > > --- end perl code --- > > > > _______________________________________________ > Haskell-Cafe mailing list > Haskell-Cafe@... > http://www.haskell.org/mailman/listinfo/haskell-cafe > -- Eugene Kirpichov Web IR developer, market.yandex.ru _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@... http://www.haskell.org/mailman/listinfo/haskell-cafe |
|
|
|
Re: help with Haskell performanceOn Sun, Nov 08, 2009 at 12:15:57AM +0300, Eugene Kirpichov wrote:
> Here, you should not use Map.fromListWith (+) because Map is not > strict in its entries and you end up having big fat thunks there. > You should use Map.fromListWith plus where x `plus` y = x `seq` y `seq` x+y. fromListWith is implemented with insertWithKey; is defining plus like above enough? I would guess that we need fromListWith' implemented with insertWithKey'. The problem is that it will build a thunk (x `plus` y) and your seq's only say that whenever you evaluate that thunk you'll evaluate the arguments as well. If the thunk itself isn't needed, it won't be evaluated. -- Felipe. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@... http://www.haskell.org/mailman/listinfo/haskell-cafe |
|
|
|
Re: help with Haskell performancegpnair78:
> I really hope I'm missing some obvious optimization that's making it so slow > compared to the perl version, hence this email soliciting feedback. Here's my first attempt. 1.5s on a 2M line log file in the format you give. General notes: * unpack is almost always wrong. * list indexing with !! is almost always wrong. * words/lines are often wrong for parsing large files (they build large list structures). * toList/fromList probably aren't the best strategy * sortBy (comparing snd) * use insertWith' Spefically, avoid constructing intermediate lists, when you can process the entire file in a single pass. Use O(1) bytestring substring operations like take and drop. Compiling: $ ghc -O2 /tmp/B.hs --make Running: $ time /tmp/B ("GET /url1 HTTP/1.1]",1000000) ("GET /url2 HTTP/1.0]",500000) /tmp/B 1.38s user 0.21s system 99% cpu 1.595 total And the code: {-# LANGUAGE OverloadedStrings #-} {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Char8 as L import qualified Data.Map as M main = do l <- L.readFile "/tmp/x" mapM_ print . M.toList $ go l M.empty where go !s !acc | L.null s = acc | " 08:" `L.isPrefixOf` s1 = go s4 acc' | otherwise = go s4 acc where s1 = L.drop 11 s -- drop prefix to timestamp -- now extract the key (_,s2) = L.breakSubstring "GET" s1 (k,s3) = L.break ((==) ':') s2 -- drop the rest of the line s4 = L.tail (L.dropWhile ((/=) '\n') s3) acc' = M.insertWith' (+) k 1 acc ------------------------------------------------------------------------ _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@... http://www.haskell.org/mailman/listinfo/haskell-cafe |
|
|
|
Re: help with Haskell performanceAh, you're right. Then we need a foldl' insertWith with a strict plus.
2009/11/8 Felipe Lessa <felipe.lessa@...>: > On Sun, Nov 08, 2009 at 12:15:57AM +0300, Eugene Kirpichov wrote: >> Here, you should not use Map.fromListWith (+) because Map is not >> strict in its entries and you end up having big fat thunks there. >> You should use Map.fromListWith plus where x `plus` y = x `seq` y `seq` x+y. > > fromListWith is implemented with insertWithKey; is defining plus > like above enough? I would guess that we need fromListWith' > implemented with insertWithKey'. > > The problem is that it will build a thunk (x `plus` y) and your > seq's only say that whenever you evaluate that thunk you'll > evaluate the arguments as well. If the thunk itself isn't > needed, it won't be evaluated. > > -- > Felipe. > _______________________________________________ > Haskell-Cafe mailing list > Haskell-Cafe@... > http://www.haskell.org/mailman/listinfo/haskell-cafe > -- Eugene Kirpichov Web IR developer, market.yandex.ru _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@... http://www.haskell.org/mailman/listinfo/haskell-cafe |
|
|
|
Re: help with Haskell performance2009/11/7 Eugene Kirpichov <ekirpichov@...>:
> Ah, you're right. Then we need a foldl' insertWith with a strict plus. We "only" need a foldl' insertWith' : (+) is already strict for all the numeric types in the Prelude. -- Jedaï _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@... http://www.haskell.org/mailman/listinfo/haskell-cafe |
|
|
|
Re: help with Haskell performance
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@... http://www.haskell.org/mailman/listinfo/haskell-cafe |
|
|
|
Re: help with Haskell performanceOn Tue, Nov 10, 2009 at 8:20 PM, Gokul P. Nair <gpnair78@...> wrote:
Take a look at Data.ByteString.Lazy.Char8.putStrLn. That prints a lazy ByteString without unpacking it, and without the internal markers. Sincerely, Brad _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@... http://www.haskell.org/mailman/listinfo/haskell-cafe |
| Free embeddable forum powered by Nabble | Forum Help |