
I have the below program, and I'm trying to run it on an input of about 90MB. It eats RAM like crazy, and I can't figure out why. I do know that the problem is not my custwords function (as you can see, I replaced the call to it with a call to the standard words function on the last line). It seems to be wordfreq, but I don't know why. For information about the problem, see http://changelog.complete.org/posts/535-A-Haskell-solution-to-Lars-Problem.h... Here's the code: import System.Environment import Data.List import Data.Char import qualified Data.Map as Map custwords = filter (/= "") . lines . map (conv . toLower) where iswordchar x = isAlphaNum x && isAscii x conv x = if iswordchar x then x else '\n' wordfreq inp = Map.toList $ foldl' updatemap (Map.empty::Map.Map String Int) inp where updatemap nm word = Map.insertWith updatefunc word 1 nm updatefunc _ x = x + 1 freqsort (w1, c1) (w2, c2) = if c1 == c2 then compare w1 w2 else compare c2 c1 showit (word, count) = show count ++ " " ++ word main = do args <- getArgs interact $ unlines . map showit . take (read . head $ args) . sortBy freqsort . wordfreq . words

The x+1 looks suspicious. On Sep 4, 2006, at 23:03 , John Goerzen wrote:
I have the below program, and I'm trying to run it on an input of about 90MB. It eats RAM like crazy, and I can't figure out why.
I do know that the problem is not my custwords function (as you can see, I replaced the call to it with a call to the standard words function on the last line). It seems to be wordfreq, but I don't know why.
For information about the problem, see http://changelog.complete.org/posts/535-A-Haskell-solution-to-Lars- Problem.html
Here's the code:
import System.Environment import Data.List import Data.Char import qualified Data.Map as Map
custwords = filter (/= "") . lines . map (conv . toLower) where iswordchar x = isAlphaNum x && isAscii x conv x = if iswordchar x then x else '\n'
wordfreq inp = Map.toList $ foldl' updatemap (Map.empty::Map.Map String Int) inp where updatemap nm word = Map.insertWith updatefunc word 1 nm updatefunc _ x = x + 1
freqsort (w1, c1) (w2, c2) = if c1 == c2 then compare w1 w2 else compare c2 c1
showit (word, count) = show count ++ " " ++ word main = do args <- getArgs interact $ unlines . map showit . take (read . head $ args) . sortBy freqsort . wordfreq . words
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

At Tue, 5 Sep 2006 03:03:51 +0000 (UTC), John Goerzen wrote:
I have the below program, and I'm trying to run it on an input of about 90MB. It eats RAM like crazy, and I can't figure out why.
I have not looked in detail at your code -- but it could simply be the fact that String requires gobs of memory to store a string. If you forced all 90MB into memory at once, I would expect it to take almost of gig of RAM. (Around a 10-11 fold increase in size). I suspect this line could be forcing the whole thing into memory:
wordfreq = map (\x -> (head x, length x)) . group . sort
because sort can not return the first element until it has looked at all the elements in the list to determine which one should be first. If you fold a Data.Map or associative list over the word-list, then you could probably get the lazy behaviour you expect. j.

jeremy.shaw:
At Tue, 5 Sep 2006 03:03:51 +0000 (UTC), John Goerzen wrote:
I have the below program, and I'm trying to run it on an input of about 90MB. It eats RAM like crazy, and I can't figure out why.
I have not looked in detail at your code -- but it could simply be the fact that String requires gobs of memory to store a string. If you forced all 90MB into memory at once, I would expect it to take almost of gig of RAM. (Around a 10-11 fold increase in size).
I suspect this line could be forcing the whole thing into memory:
wordfreq = map (\x -> (head x, length x)) . group . sort
because sort can not return the first element until it has looked at all the elements in the list to determine which one should be first.
If you fold a Data.Map or associative list over the word-list, then you could probably get the lazy behaviour you expect.
A quick hack up to use Data.ByteString uses a lot less ram, though profiling still shows 95% of time spent in the building the Map. import System.Environment import Data.Char import Data.List import qualified Data.Map as Map import qualified Data.ByteString.Char8 as B import Data.ByteString (ByteString) wordfreq inp = Map.toList $ foldl' k m inp where m = Map.empty :: Map.Map ByteString Int k n w = Map.insertWith f w 1 n f _ x = let y = x + 1 in y `seq` y freqsort (w1, c1) (w2, c2) | c1 == c2 = compare w1 w2 | otherwise = compare c2 c1 showit (w, c) = B.join (B.singleton ' ') [B.pack(show c), w] main :: IO () main = do args <- getArgs B.interact $ B.unlines . map showit . take (read . head $ args) . sortBy freqsort . wordfreq . B.words -- Don

Donald Bruce Stewart wrote:
A quick hack up to use Data.ByteString uses a lot less ram, though profiling still shows 95% of time spent in the building the Map.
Nice!
k n w = Map.insertWith f w 1 n f _ x = let y = x + 1 in y `seq` y
y `seq` y is semantically equivalent to y though. The strictness would have to be built into insertWith to make this work. regards, Bertram

Hello Bertram, Tuesday, September 5, 2006, 12:24:57 PM, you wrote:
A quick hack up to use Data.ByteString uses a lot less ram, though profiling still shows 95% of time spent in the building the Map.
Data.HashTable may be a faster alternative for Map (if ordering isn't required) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Hello Bertram,
Tuesday, September 5, 2006, 12:24:57 PM, you wrote:
A quick hack up to use Data.ByteString uses a lot less ram, though profiling still shows 95% of time spent in the building the Map.
Data.HashTable may be a faster alternative for Map (if ordering isn't required)
I found Data.HashTable a bit slow (ghc 6.4). Perhaps HsJudy (see http://cmarcelo.blogspot.com/ and http://judy.sourceforge.net/ and http://www.mail-archive.com/haskell@haskell.org/msg18766.html )

On Sep 5, 2006, at 7:05 AM, Chris Kuklewicz wrote:
Bulat Ziganshin wrote:
Hello Bertram, Tuesday, September 5, 2006, 12:24:57 PM, you wrote:
A quick hack up to use Data.ByteString uses a lot less ram, though profiling still shows 95% of time spent in the building the Map. Data.HashTable may be a faster alternative for Map (if ordering isn't required)
I found Data.HashTable a bit slow (ghc 6.4). Perhaps HsJudy (see http://cmarcelo.blogspot.com/ and http://judy.sourceforge.net/ and http://www.mail-archive.com/haskell@haskell.org/msg18766.html )
I'd urge programmers to give the version of Data.HashTable in 6.6 a try. It uses a simple multiplicative hash function (a la Knuth) which seems to be dramatically better in practice. It also uses a rather simpler hash table implementation which seems to perform slightly better in practice (if this isn't true for your application I'm keen to know). As Udo Stenzel points out, we still need to examine the entire string in order to hash, and some problems may do better with something like a StringMap---I understand many information retrieval applications use Trie-like data structures for exactly this reason. -Jan-Willem Maessen
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Bulat Ziganshin wrote:
Data.HashTable may be a faster alternative for Map (if ordering isn't required)
Or it may not. Finding a good hash function for the words John is counting, is a challenge itself. Finding a good one that doesn't look at each character at least once, might be outright impossible. That means, a hash table cannot do significantly less work than the appropriate data structure, which is a trie, aka Data.StringMap. Udo. -- Q: Why do mountain climbers rope themselves together? A: To prevent the sensible ones from going home.

On 2006-09-05, Bulat Ziganshin
Hello Bertram,
Tuesday, September 5, 2006, 12:24:57 PM, you wrote:
A quick hack up to use Data.ByteString uses a lot less ram, though profiling still shows 95% of time spent in the building the Map.
Data.HashTable may be a faster alternative for Map (if ordering isn't required)
Indeed ordering wasn't required, but as HashTable lives in the IO monad, it's not, well, very "Haskellish". -- John

At Mon, 04 Sep 2006 22:05:57 -0700, Jeremy Shaw wrote:
At Tue, 5 Sep 2006 03:03:51 +0000 (UTC), John Goerzen wrote:
I have the below program, and I'm trying to run it on an input of about 90MB. It eats RAM like crazy, and I can't figure out why.
If you fold a Data.Map or associative list over the word-list, then you could probably get the lazy behaviour you expect.
Oops, I got distracted and looked at the version on the top of the site, not your email :) Anyway, I will still consider the possibility that the hold String is being forced into memory. j.

John Goerzen wrote:
I have the below program, and I'm trying to run it on an input of about 90MB. It eats RAM like crazy, and I can't figure out why.
wordfreq inp = Map.toList $ foldl' updatemap (Map.empty::Map.Map String Int) inp where updatemap nm word = Map.insertWith updatefunc word 1 nm updatefunc _ x = x + 1
The culprit is insertWith, it inserts unevaluated thunks into your map where you want a simple value. To avoid a space leak, you want a strict update function (yours is strict enough) and insertWith must be strict in the newly inserted value (the result of applying updatefunc). Since you cannot influence the strictness of insertWith, no matter how many seqs you sprinkle through your code, you need insertWith', which is missing. You can simulate it, however: insertWith' f k v m = case Map.lookup k m of Nothing -> Map.insert k v m Just w -> (Map.insert k $! f w v) m IMHO all accumulating functions, especially foldl, State.update, Map.insertWith, accumArray, absolutely need a strict version, because the strictness cannot be recovered by the library's user. If the clutter of too many primed names is unbearable, leave out the _lazy_ version. It's useless IME and lazyness can be recovered if the need arises. Udo. -- Wo die Macht geistlos ist, ist der Geist machtlos. -- aus einem Gipfelbuch

On 2006-09-05, Udo Stenzel
The culprit is insertWith, it inserts unevaluated thunks into your map
This turned out to be the answer -- thanks! I posted a new version of the code here: http://changelog.complete.org/posts/536-Another-Haskell-Solution-to-Lars-Pro... This particular test set was a few hundred copies of the GPL. So the Map was fairly small, since each word occured many, many times -- but there weren't a whole lot of words. So the problem was not the inefficiency of Data.Map, nor the inefficiency of Strings (though that inefficiency does explain why the Python solution is faster, I'm sure). -- John

On Tue, Sep 05, 2006 at 12:55:48PM +0200, Udo Stenzel wrote:
IMHO all accumulating functions, especially foldl, State.update, Map.insertWith, accumArray, absolutely need a strict version, because the strictness cannot be recovered by the library's user.
We already have foldl'. Here's a strict version of fmap: import Control.Applicative import Data.Traversable newtype Strict a = Strict a getStrict (Strict x) = x instance Functor Strict where fmap f (Strict x) = Strict (f x) -- doesn't quite satisfy the Applicative laws instance Applicative Strict where pure x = Strict x Strict f <*> Strict x = Strict (f $! x) fmap' :: Traversable f => (a -> b) -> f a -> f b fmap' f t = getStrict (traverse (Strict . f) t)
participants (10)
-
Bertram Felgenhauer
-
Bulat Ziganshin
-
Chris Kuklewicz
-
dons@cse.unsw.edu.au
-
Jan-Willem Maessen
-
Jeremy Shaw
-
John Goerzen
-
Lennart Augustsson
-
Ross Paterson
-
Udo Stenzel