Why does this program eat RAM?

View: New views
14 Messages — Rating Filter:   Alert me  

Why does this program eat RAM?

by John Goerzen-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I have the below program, and I'm trying to run it on an input of about
90MB.  It eats RAM like crazy, and I can't figure out why.

I do know that the problem is not my custwords function (as you can see,
I replaced the call to it with a call to the standard words function on
the last line).  It seems to be wordfreq, but I don't know why.

For information about the problem, see
http://changelog.complete.org/posts/535-A-Haskell-solution-to-Lars-Problem.html

Here's the code:

import System.Environment
import Data.List
import Data.Char
import qualified Data.Map as Map

custwords = filter (/= "") . lines . map (conv . toLower)
    where iswordchar x = isAlphaNum x && isAscii x
          conv x = if iswordchar x then x else '\n'

wordfreq inp = Map.toList $ foldl' updatemap (Map.empty::Map.Map String Int) inp
    where updatemap nm word = Map.insertWith updatefunc word 1 nm
          updatefunc _ x = x + 1

freqsort (w1, c1) (w2, c2) = if c1 == c2
                                 then compare w1 w2
                                 else compare c2 c1

showit (word, count) = show count ++ " " ++ word
main = do args <- getArgs
          interact $ unlines . map showit . take (read . head $ args) .
                     sortBy freqsort . wordfreq . words


_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@...
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: Why does this program eat RAM?

by Lennart Augustsson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The x+1 looks suspicious.

On Sep 4, 2006, at 23:03 , John Goerzen wrote:

> I have the below program, and I'm trying to run it on an input of  
> about
> 90MB.  It eats RAM like crazy, and I can't figure out why.
>
> I do know that the problem is not my custwords function (as you can  
> see,
> I replaced the call to it with a call to the standard words  
> function on
> the last line).  It seems to be wordfreq, but I don't know why.
>
> For information about the problem, see
> http://changelog.complete.org/posts/535-A-Haskell-solution-to-Lars- 
> Problem.html
>
> Here's the code:
>
> import System.Environment
> import Data.List
> import Data.Char
> import qualified Data.Map as Map
>
> custwords = filter (/= "") . lines . map (conv . toLower)
>     where iswordchar x = isAlphaNum x && isAscii x
>           conv x = if iswordchar x then x else '\n'
>
> wordfreq inp = Map.toList $ foldl' updatemap (Map.empty::Map.Map  
> String Int) inp
>     where updatemap nm word = Map.insertWith updatefunc word 1 nm
>           updatefunc _ x = x + 1
>
> freqsort (w1, c1) (w2, c2) = if c1 == c2
>                                  then compare w1 w2
>                                  else compare c2 c1
>
> showit (word, count) = show count ++ " " ++ word
> main = do args <- getArgs
>           interact $ unlines . map showit . take (read . head $  
> args) .
>                      sortBy freqsort . wordfreq . words
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe@...
> http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@...
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: Why does this program eat RAM?

by Jeremy Shaw :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

At Tue, 5 Sep 2006 03:03:51 +0000 (UTC),
John Goerzen wrote:
>
> I have the below program, and I'm trying to run it on an input of about
> 90MB.  It eats RAM like crazy, and I can't figure out why.

I have not looked in detail at your code -- but it could simply be the
fact that String requires gobs of memory to store a string. If you
forced all 90MB into memory at once, I would expect it to take almost
of gig of RAM. (Around a 10-11 fold increase in size).

I suspect this line could be forcing the whole thing into memory:

> wordfreq = map (\x -> (head x, length x)) . group . sort

because sort can not return the first element until it has looked at
all the elements in the list to determine which one should be first.

If you fold a Data.Map or associative list over the word-list, then
you could probably get the lazy behaviour you expect.

j.
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@...
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: Why does this program eat RAM?

by Donald Bruce Stewart :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

jeremy.shaw:

> At Tue, 5 Sep 2006 03:03:51 +0000 (UTC),
> John Goerzen wrote:
> >
> > I have the below program, and I'm trying to run it on an input of about
> > 90MB.  It eats RAM like crazy, and I can't figure out why.
>
> I have not looked in detail at your code -- but it could simply be the
> fact that String requires gobs of memory to store a string. If you
> forced all 90MB into memory at once, I would expect it to take almost
> of gig of RAM. (Around a 10-11 fold increase in size).
>
> I suspect this line could be forcing the whole thing into memory:
>
> > wordfreq = map (\x -> (head x, length x)) . group . sort
>
> because sort can not return the first element until it has looked at
> all the elements in the list to determine which one should be first.
>
> If you fold a Data.Map or associative list over the word-list, then
> you could probably get the lazy behaviour you expect.

A quick hack up to use Data.ByteString uses a lot less ram, though
profiling still shows 95% of time spent in the building the Map.

    import System.Environment
    import Data.Char
    import Data.List
    import qualified Data.Map as Map

    import qualified Data.ByteString.Char8 as B
    import Data.ByteString (ByteString)

    wordfreq inp = Map.toList $ foldl' k m inp
        where
          m     = Map.empty :: Map.Map ByteString Int
          k n w = Map.insertWith f w 1 n
          f _ x = let y = x + 1 in y `seq` y

    freqsort (w1, c1) (w2, c2) | c1 == c2  = compare w1 w2
                               | otherwise = compare c2 c1

    showit (w, c) = B.join (B.singleton ' ') [B.pack(show c), w]

    main :: IO ()
    main = do args <- getArgs
              B.interact $ B.unlines . map showit . take (read . head $ args)
                         . sortBy freqsort . wordfreq . B.words

-- Don
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@...
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: Why does this program eat RAM?

by Jeremy Shaw :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

At Mon, 04 Sep 2006 22:05:57 -0700,
Jeremy Shaw wrote:
>
> At Tue, 5 Sep 2006 03:03:51 +0000 (UTC),
> John Goerzen wrote:
> >
> > I have the below program, and I'm trying to run it on an input of about
> > 90MB.  It eats RAM like crazy, and I can't figure out why.

> If you fold a Data.Map or associative list over the word-list, then
> you could probably get the lazy behaviour you expect.

Oops, I got distracted and looked at the version on the top of the
site, not your email :)

Anyway, I will still consider the possibility that the hold String is
being forced into memory.

j.

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@...
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: Why does this program eat RAM?

by Bertram Felgenhauer-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Donald Bruce Stewart wrote:
> A quick hack up to use Data.ByteString uses a lot less ram, though
> profiling still shows 95% of time spent in the building the Map.

Nice!

>           k n w = Map.insertWith f w 1 n
>           f _ x = let y = x + 1 in y `seq` y

y `seq` y is semantically equivalent to y though. The strictness
would have to be built into insertWith to make this work.

regards,

Bertram
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@...
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re[2]: Why does this program eat RAM?

by Bulat Ziganshin-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello Bertram,

Tuesday, September 5, 2006, 12:24:57 PM, you wrote:

>> A quick hack up to use Data.ByteString uses a lot less ram, though
>> profiling still shows 95% of time spent in the building the Map.

Data.HashTable may be a faster alternative for Map (if ordering isn't
required)


--
Best regards,
 Bulat                            mailto:Bulat.Ziganshin@...

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@...
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: Why does this program eat RAM?

by Udo Stenzel :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

John Goerzen wrote:
> I have the below program, and I'm trying to run it on an input of about
> 90MB.  It eats RAM like crazy, and I can't figure out why.
>
> wordfreq inp = Map.toList $ foldl' updatemap (Map.empty::Map.Map String Int) inp
>     where updatemap nm word = Map.insertWith updatefunc word 1 nm
>           updatefunc _ x = x + 1

The culprit is insertWith, it inserts unevaluated thunks into your map
where you want a simple value.  To avoid a space leak, you want a strict
update function (yours is strict enough) and insertWith must be strict
in the newly inserted value (the result of applying updatefunc).  Since
you cannot influence the strictness of insertWith, no matter how many
seqs you sprinkle through your code, you need insertWith', which is
missing.  You can simulate it, however:

insertWith' f k v m = case Map.lookup k m of
        Nothing -> Map.insert k v m
        Just w  -> (Map.insert k $! f w v) m

IMHO all accumulating functions, especially foldl, State.update,
Map.insertWith, accumArray, absolutely need a strict version, because
the strictness cannot be recovered by the library's user.  If the
clutter of too many primed names is unbearable, leave out the _lazy_
version.  It's useless IME and lazyness can be recovered if the need
arises.


Udo.
--
Wo die Macht geistlos ist, ist der Geist machtlos.
        -- aus einem Gipfelbuch


_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@...
http://www.haskell.org/mailman/listinfo/haskell-cafe

signature.asc (196 bytes) Download Attachment

Re: Why does this program eat RAM?

by ChrisK-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Bulat Ziganshin wrote:

> Hello Bertram,
>
> Tuesday, September 5, 2006, 12:24:57 PM, you wrote:
>
>>> A quick hack up to use Data.ByteString uses a lot less ram, though
>>> profiling still shows 95% of time spent in the building the Map.
>
> Data.HashTable may be a faster alternative for Map (if ordering isn't
> required)
>
>

I found Data.HashTable a bit slow (ghc 6.4).  Perhaps HsJudy (see
http://cmarcelo.blogspot.com/ and http://judy.sourceforge.net/ and
http://www.mail-archive.com/haskell@.../msg18766.html )
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@...
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: Re[2]: Why does this program eat RAM?

by Udo Stenzel :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Bulat Ziganshin wrote:
> Data.HashTable may be a faster alternative for Map (if ordering isn't
> required)

Or it may not.  Finding a good hash function for the words John is
counting, is a challenge itself.  Finding a good one that doesn't look
at each character at least once, might be outright impossible.  That
means, a hash table cannot do significantly less work than the
appropriate data structure, which is a trie, aka Data.StringMap.


Udo.
--
Q:      Why do mountain climbers rope themselves together?
A:      To prevent the sensible ones from going home.


_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@...
http://www.haskell.org/mailman/listinfo/haskell-cafe

signature.asc (196 bytes) Download Attachment

Re: Why does this program eat RAM?

by John Goerzen-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 2006-09-05, Udo Stenzel <u.stenzel@...> wrote:
> The culprit is insertWith, it inserts unevaluated thunks into your map

This turned out to be the answer -- thanks!

I posted a new version of the code here:

http://changelog.complete.org/posts/536-Another-Haskell-Solution-to-Lars-Problem.html

This particular test set was a few hundred copies of the GPL.  So the
Map was fairly small, since each word occured many, many times -- but
there weren't a whole lot of words.  So the problem was not the
inefficiency of Data.Map, nor the inefficiency of Strings (though that
inefficiency does explain why the Python solution is faster, I'm sure).

-- John


_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@...
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: Re[2]: Why does this program eat RAM?

by John Goerzen-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On 2006-09-05, Bulat Ziganshin <bulat.ziganshin@...> wrote:
> Hello Bertram,
>
> Tuesday, September 5, 2006, 12:24:57 PM, you wrote:
>
>>> A quick hack up to use Data.ByteString uses a lot less ram, though
>>> profiling still shows 95% of time spent in the building the Map.
>
> Data.HashTable may be a faster alternative for Map (if ordering isn't
> required)

Indeed ordering wasn't required, but as HashTable lives in the IO monad,
it's not, well, very "Haskellish".

-- John

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@...
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: Why does this program eat RAM?

by Ross Paterson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Sep 05, 2006 at 12:55:48PM +0200, Udo Stenzel wrote:
> IMHO all accumulating functions, especially foldl, State.update,
> Map.insertWith, accumArray, absolutely need a strict version, because
> the strictness cannot be recovered by the library's user.

We already have foldl'.  Here's a strict version of fmap:

import Control.Applicative
import Data.Traversable

newtype Strict a = Strict a
getStrict (Strict x) = x

instance Functor Strict where
        fmap f (Strict x) = Strict (f x)

-- doesn't quite satisfy the Applicative laws
instance Applicative Strict where
        pure x = Strict x
        Strict f <*> Strict x = Strict (f $! x)

fmap' :: Traversable f => (a -> b) -> f a -> f b
fmap' f t = getStrict (traverse (Strict . f) t)

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@...
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: Why does this program eat RAM?

by Jan-Willem Maessen :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Sep 5, 2006, at 7:05 AM, Chris Kuklewicz wrote:

> Bulat Ziganshin wrote:
>> Hello Bertram,
>> Tuesday, September 5, 2006, 12:24:57 PM, you wrote:
>>>> A quick hack up to use Data.ByteString uses a lot less ram, though
>>>> profiling still shows 95% of time spent in the building the Map.
>> Data.HashTable may be a faster alternative for Map (if ordering isn't
>> required)
>
> I found Data.HashTable a bit slow (ghc 6.4).  Perhaps HsJudy (see  
> http://cmarcelo.blogspot.com/ and http://judy.sourceforge.net/ and  
> http://www.mail-archive.com/haskell@.../msg18766.html )
I'd urge programmers to give the version of Data.HashTable in 6.6 a  
try.  It uses a simple multiplicative hash function (a la Knuth)  
which seems to be dramatically better in practice.  It also uses a  
rather simpler hash table implementation which seems to perform  
slightly better in practice (if this isn't true for your application  
I'm keen to know).

As Udo Stenzel points out, we still need to examine the entire string  
in order to hash, and some problems may do better with something like  
a StringMap---I understand many information retrieval applications  
use Trie-like data structures for exactly this reason.

-Jan-Willem Maessen

> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe@...
> http://www.haskell.org/mailman/listinfo/haskell-cafe



_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@...
http://www.haskell.org/mailman/listinfo/haskell-cafe

smime.p7s (3K) Download Attachment