help with Haskell performance

View: New views
8 Messages — Rating Filter:   Alert me  

help with Haskell performance

by Gokul P. Nair :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi all,

The task I'm trying to accomplish:

Given a log file containing several lines of white space delimited entries like this:

[Sat Oct 24 08:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not exist
[Sat Oct 24 08:12:37 2009] [error] GET /url2 HTTP/1.0]: Requested URI does not exist
[Sat Oct 24 08:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not exist
[Sat Oct 24 12:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not exist

filter lines that match the string " 08:", extract the 6th, 7th and 8th words from that line, group all lines that have the the same resulting string, do a count on them and sort the result in descending order of counts and print it out. So in the example above we'd end up with an output like this:

("GET /url1 HTTP/1.1]:", 2)
("GET /url2 HTTP/1.0]:", 1)

Seems pretty straightforward, so I wrote a simple perl script to achieve this task (see the bottom of this email).

The input file is 335 MB in size and contains about 2 million log line entires in it. The perl script does a pretty decent job and finishes in about 3 seconds.

Now the interesting part. I decided to implememt this in Haskell (being my favorite language and all) and ended up with the following code:

--- begin haskell code ---

import Text.Regex.Posix ( (=~) )
import qualified Data.List as List  
import qualified Data.Map as Map
import qualified Data.ByteString.Lazy.Char8 as LB

main = do
  contents <- LB.readFile "log_file"
  putStr . unlines . map ( show . (\(x, y) -> ((LB.unpack x), y)) ) .
    -- create a Map grouping & counting matching tokens and sort based on the counts
    List.sortBy (\(_, x) (_, y) -> y `compare` x) . Map.toList . Map.fromListWith (+) . filtertokens .
    LB.lines $ contents
  where filtertokens = foldr (\x acc -> if (f x) then ((g x) : acc) else acc) []
          -- filter lines starting with " 08:"
          where f = (=~ " 08:") . LB.unpack
                -- extract tokens 6, 7 & 8 and create an association list like so ("GET /url2 HTTP/1.0]:", 1)
                g line = flip (,) 1 . LB.unwords . map (xs !!) $ [6, 7, 8] where xs = LB.words line

--- end haskell code ---

This haskell implementation takes a whopping 27 seconds to complete! About 9 times slower than the perl version! I'm using ghc 6.10.4, compiling with -O2 and even went to the extent of fusing an adjacent map and filter using a foldr like so: map f (filter g) => foldr ( if g x then f x ... ), fusing adjacents maps etc. Still the same result.

I really hope I'm missing some obvious optimization that's making it so slow compared to the perl version, hence this email soliciting feedback.

Thanks in advance.

P.S. For reference, here's my corresponding perl implementation:

--- start perl code ---

#!/usr/bin/perl
use strict;
use warnings FATAL => 'all';

my %urls;
open(FILE, '<', $ARGV[0]);
while(<FILE>) {
    if (/ 08:/) {
        my @words = split;
        my $key = join(" ", ($words[6], $words[7], $words[8]));
        if (exists $urls{$key}) { $urls{$key}++ }
        else { $urls{$key} = 1 }
    }
}
for (sort { $urls{$b} <=> $urls{$a} } keys %urls) { print "($_, $urls{$_})\n" }

--- end perl code ---



_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@...
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: help with Haskell performance

by jkff :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

2009/11/8 Gokul P. Nair <gpnair78@...>

>
> Hi all,
>
> The task I'm trying to accomplish:
>
> Given a log file containing several lines of white space delimited entries like this:
>
> [Sat Oct 24 08:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not exist
> [Sat Oct 24 08:12:37 2009] [error] GET /url2 HTTP/1.0]: Requested URI does not exist
> [Sat Oct 24 08:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not exist
> [Sat Oct 24 12:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not exist
>
> filter lines that match the string " 08:", extract the 6th, 7th and 8th words from that line, group all lines that have the the same resulting string, do a count on them and sort the result in descending order of counts and print it out. So in the example above we'd end up with an output like this:
>
> ("GET /url1 HTTP/1.1]:", 2)
> ("GET /url2 HTTP/1.0]:", 1)
>
> Seems pretty straightforward, so I wrote a simple perl script to achieve this task (see the bottom of this email).
>
> The input file is 335 MB in size and contains about 2 million log line entires in it. The perl script does a pretty decent job and finishes in about 3 seconds.
>
> Now the interesting part. I decided to implememt this in Haskell (being my favorite language and all) and ended up with the following code:
>
> --- begin haskell code ---
>
> import Text.Regex.Posix ( (=~) )

First, you are using Text.Regex.Posix which is dog slow. Use
regex-tdfa or regex-pcre.
Second, you are using it on a String! Third, you are unpacking a lazy
bytestring for that! This alone is more than enough to make things
ridiculously slow.

>
> import qualified Data.List as List
> import qualified Data.Map as Map
> import qualified Data.ByteString.Lazy.Char8 as LB
>
> main = do
>   contents <- LB.readFile "log_file"
>   putStr . unlines . map ( show . (\(x, y) -> ((LB.unpack x), y)) ) .
This piece of code also does unpacking. If the output is small, it's
ok, otherwise assemble output in the form of a bytestring and print it
with B.putStr.

>     -- create a Map grouping & counting matching tokens and sort based on the counts
>     List.sortBy (\(_, x) (_, y) -> y `compare` x) . Map.toList . Map.fromListWith (+) . filtertokens .
The lambda can be replaced by "flip (comparing snd)"

>     LB.lines $ contents
Here, you should not use Map.fromListWith (+) because Map is not
strict in its entries and you end up having big fat thunks there.
You should use Map.fromListWith plus where x `plus` y = x `seq` y `seq` x+y.

>   where filtertokens = foldr (\x acc -> if (f x) then ((g x) : acc) else acc) []
>           -- filter lines starting with " 08:"
>           where f = (=~ " 08:") . LB.unpack
>                 -- extract tokens 6, 7 & 8 and create an association list like so ("GET /url2 HTTP/1.0]:", 1)
>                 g line = flip (,) 1 . LB.unwords . map (xs !!) $ [6, 7, 8] where xs = LB.words line
You are using random access on a list three times in a row.
map (xs!!) [6,7,8] is much faster when implemented as "take 3 (drop 6 xs)".


So.. Well. Try these suggestions, show the resulting performance and
then we'll see.

>
> --- end haskell code ---
>
> This haskell implementation takes a whopping 27 seconds to complete! About 9 times slower than the perl version! I'm using ghc 6.10.4, compiling with -O2 and even went to the extent of fusing an adjacent map and filter using a foldr like so: map f (filter g) => foldr ( if g x then f x ... ), fusing adjacents maps etc. Still the same result.
>
> I really hope I'm missing some obvious optimization that's making it so slow compared to the perl version, hence this email soliciting feedback.
>
> Thanks in advance.
>
> P.S. For reference, here's my corresponding perl implementation:
>
> --- start perl code ---
>
> #!/usr/bin/perl
> use strict;
> use warnings FATAL => 'all';
>
> my %urls;
> open(FILE, '<', $ARGV[0]);
> while(<FILE>) {
>     if (/ 08:/) {
>         my @words = split;
>         my $key = join(" ", ($words[6], $words[7], $words[8]));
>         if (exists $urls{$key}) { $urls{$key}++ }
>         else { $urls{$key} = 1 }
>     }
> }
> for (sort { $urls{$b} <=> $urls{$a} } keys %urls) { print "($_, $urls{$_})\n" }
>
> --- end perl code ---
>
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe@...
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>



--
Eugene Kirpichov
Web IR developer, market.yandex.ru
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@...
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: help with Haskell performance

by Felipe Lessa :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, Nov 08, 2009 at 12:15:57AM +0300, Eugene Kirpichov wrote:
> Here, you should not use Map.fromListWith (+) because Map is not
> strict in its entries and you end up having big fat thunks there.
> You should use Map.fromListWith plus where x `plus` y = x `seq` y `seq` x+y.

fromListWith is implemented with insertWithKey; is defining plus
like above enough?  I would guess that we need fromListWith'
implemented with insertWithKey'.

The problem is that it will build a thunk (x `plus` y) and your
seq's only say that whenever you evaluate that thunk you'll
evaluate the arguments as well.  If the thunk itself isn't
needed, it won't be evaluated.

--
Felipe.
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@...
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: help with Haskell performance

by Don Stewart-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

gpnair78:
> I really hope I'm missing some obvious optimization that's making it so slow
> compared to the perl version, hence this email soliciting feedback.

Here's my first attempt. 1.5s on a 2M line log file in the format you give.

General notes:

    * unpack is almost always wrong.
    * list indexing with !! is almost always wrong.
    * words/lines are often wrong for parsing large files (they build large list structures).
    * toList/fromList probably aren't the best strategy
    * sortBy (comparing snd)
    * use insertWith'

Spefically, avoid constructing intermediate lists, when you can process the
entire file in a single pass. Use O(1) bytestring substring operations like
take and drop.

Compiling:

    $ ghc -O2 /tmp/B.hs --make

Running:

    $ time /tmp/B
    ("GET /url1 HTTP/1.1]",1000000)
    ("GET /url2 HTTP/1.0]",500000)
    /tmp/B  1.38s user 0.21s system 99% cpu 1.595 total

And the code:


    {-# LANGUAGE OverloadedStrings #-}
    {-# LANGUAGE BangPatterns      #-}

    import qualified Data.ByteString.Char8 as L
    import qualified Data.Map as M

    main = do
        l <- L.readFile "/tmp/x"
        mapM_ print . M.toList $ go l M.empty
      where
        go !s !acc
            | L.null s                 = acc
            | " 08:" `L.isPrefixOf` s1 = go s4 acc'
            | otherwise                = go s4 acc
                where
                    s1     = L.drop 11 s    -- drop prefix to timestamp

                    -- now extract the key
                    (_,s2) = L.breakSubstring "GET" s1
                    (k,s3) = L.break ((==) ':') s2

                    -- drop the rest of the line
                    s4     = L.tail (L.dropWhile ((/=) '\n') s3)

                    acc' = M.insertWith' (+) k 1 acc

------------------------------------------------------------------------
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@...
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: help with Haskell performance

by jkff :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Ah, you're right. Then we need a foldl' insertWith with a strict plus.

2009/11/8 Felipe Lessa <felipe.lessa@...>:

> On Sun, Nov 08, 2009 at 12:15:57AM +0300, Eugene Kirpichov wrote:
>> Here, you should not use Map.fromListWith (+) because Map is not
>> strict in its entries and you end up having big fat thunks there.
>> You should use Map.fromListWith plus where x `plus` y = x `seq` y `seq` x+y.
>
> fromListWith is implemented with insertWithKey; is defining plus
> like above enough?  I would guess that we need fromListWith'
> implemented with insertWithKey'.
>
> The problem is that it will build a thunk (x `plus` y) and your
> seq's only say that whenever you evaluate that thunk you'll
> evaluate the arguments as well.  If the thunk itself isn't
> needed, it won't be evaluated.
>
> --
> Felipe.
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe@...
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>



--
Eugene Kirpichov
Web IR developer, market.yandex.ru
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@...
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: help with Haskell performance

by Bugzilla from chaddai.fouche@gmail.com :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

2009/11/7 Eugene Kirpichov <ekirpichov@...>:
> Ah, you're right. Then we need a foldl' insertWith with a strict plus.

We "only" need a foldl' insertWith' : (+) is already strict for all
the numeric types in the Prelude.

--
Jedaï
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@...
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: help with Haskell performance

by Gokul P. Nair :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

--- On Sat, 11/7/09, Don Stewart <dons@...> wrote:

> General notes:
>
>  * unpack is almost always wrong.
>  * list indexing with !! is almost always wrong.
>  * words/lines are often wrong for parsing large files (they build large list structures).
>  * toList/fromList probably aren't the best strategy
>  * sortBy (comparing snd)
>  * use insertWith'
> Spefically, avoid constructing intermediate lists, when you can process the
> entire file in a single pass. Use O(1) bytestring substring operations like
> take and drop.
 
Thanks all for the valuable feedback. Switching from Regex.Posix to Regex.PCRE alone reduced the running time to about 6 secs and a few other optimizations suggested on this thread brought it down to about 5 secs ;)
 
I then set out to profile the code out of curiosity to see where the bulk of the time was being spent and sure enough the culprit turned out to be "unpack". My question therefore is, given a list L1 of type [(ByteString, Int)], how do I print it out so as to eliminate the "chunk, empty" markers associated with a bytestring? The suggestions posted here are along the lines of "mapM_ print L1" but that's far from desirable especially because the generated output is for perusal by non-technical users etc.
 
Thanks.


_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@...
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: help with Haskell performance

by Brad Larsen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Nov 10, 2009 at 8:20 PM, Gokul P. Nair <gpnair78@...> wrote:
--- On Sat, 11/7/09, Don Stewart <dons@...> wrote:
> General notes:
>
>  * unpack is almost always wrong.
>  * list indexing with !! is almost always wrong.
>  * words/lines are often wrong for parsing large files (they build large list structures).
>  * toList/fromList probably aren't the best strategy
>  * sortBy (comparing snd)
>  * use insertWith'
> Spefically, avoid constructing intermediate lists, when you can process the
> entire file in a single pass. Use O(1) bytestring substring operations like
> take and drop.
 
Thanks all for the valuable feedback. Switching from Regex.Posix to Regex.PCRE alone reduced the running time to about 6 secs and a few other optimizations suggested on this thread brought it down to about 5 secs ;)
 
I then set out to profile the code out of curiosity to see where the bulk of the time was being spent and sure enough the culprit turned out to be "unpack". My question therefore is, given a list L1 of type [(ByteString, Int)], how do I print it out so as to eliminate the "chunk, empty" markers associated with a bytestring? The suggestions posted here are along the lines of "mapM_ print L1" but that's far from desirable especially because the generated output is for perusal by non-technical users etc.
 
Thanks.


Take a look at Data.ByteString.Lazy.Char8.putStrLn.  That prints a lazy ByteString without unpacking it, and without the internal markers.

Sincerely,
Brad


_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@...
http://www.haskell.org/mailman/listinfo/haskell-cafe