
Hello list, I'm currently writing a small linguistic corpus analyzer. My input file is only 25MB, but profiling shows that the overall amount of allocation over the program's runtime is several GB. That's a little too much - adding to that is the fact that the program is abysmally slow, so I'm suspecting a space leak somewhere. I'm using the ByteString.Lazy.Char8 class in order to work efficiently with lazy IO and I must admit that I'm very inexperienced with predicting runtime and space behaviour of lazy IO :-( It worked well in the past, but I'm stuck now. The program can be found here: http://hpaste.org/fastcgi/hpaste.fcgi/view?id=11863#a11863 The important bits are as follows:
mf :: [C.ByteString] -> StdWord mf [] = Word [] C.empty mf s = Word (tail s) (head s)
f' = mf . reverse . C.words
main :: IO () main = do corpus_name <- liftM head getArgs corpus <- liftM (Corpus . (map f') . C.lines) $ C.readFile corpus_name print $ length (content corpus) let interesting = filterForInterestingTags interestingTags corpus print $ show (freqMap interesting)
freqMap also uses foldr' to fold the corpus it is given into a Data.Map, but according to the profiling output, that's not where the bad stuff happens. Here's the interesting parts of the profiling output:
analyse +RTS -p -RTS corpus/bnc-samp.tt
total time = 30.46 secs (1523 ticks @ 20 ms) total alloc = 14,871,619,904 bytes (excludes profiling overheads)
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0 main Main 221 0 0.0 0.0 0.0 0.0 CAF Main 206 14 0.1 0.1 100.0 100.0 showsPrec_aSF Main 223 1 0.0 0.0 0.0 0.0 interestingTags Main 218 1 0.0 0.0 0.0 0.0 f' Main 216 1 0.0 0.0 0.0 0.0 mf Main 217 1 0.0 0.0 0.0 0.0 main Main 212 1 2.4 4.4 99.9 99.9 f' Main 219 0 89.0 91.5 90.2 92.7 mf Main 220 2427450 1.2 1.2 1.2 1.2 filterForInterestingTags Main 215 1 5.4 0.1 5.4 0.1 freqMap Main 214 1 0.6 0.5 2.0 2.7 compare_aRL Main 222 1141996 1.4 2.2 1.4 2.2
Obviously the main cost centre seems to be f' (which I've factored out in order to view it separately as a cost centre.) It's not the call to reverse that makes it slow (removing it doesn't affect the run time.) Interestingly, it prints out the first statement (length) very quickly, then takes a lot of time. I've removed the call to freqMap, and it seems that GHC is smart enough to drop the call to f' completely, because only the length ever gets evaluated. But the freqMap also needs the processing from f', and that's where the bad stuff starts happening. Should I look into DeepSeq? I've tried adding strictness to the functions by hand, and also to the data structures, but that didn't seem to help so far. So I'm looking for solutions to make this faster. I'm guessing that a smart `seq` somewhere might help, but I don't know where :-\ As a side note: how can I find out how the run-time data structures look like in memory? I.e. I want to know *what* exactly stays around as thunks in memory so that I can focus better on where to add strictness or redirect the program flow. Ideally, only a very smart part of the file should ever be in memory, with processing happening incrementally! Thanks for any pointers! Aleks

Excerpts from Aleksandar Dimitrov's message of Thu Nov 05 12:01:11 +0100 2009:
Hello list,
I'm currently writing a small linguistic corpus analyzer. My input file is only 25MB, but profiling shows that the overall amount of allocation over the program's runtime is several GB. That's a little too much - adding to that is the fact that the program is abysmally slow, so I'm suspecting a space leak somewhere.
I'm using the ByteString.Lazy.Char8 class in order to work efficiently with lazy IO and I must admit that I'm very inexperienced with predicting runtime and space behaviour of lazy IO :-( It worked well in the past, but I'm stuck now.
The program can be found here: http://hpaste.org/fastcgi/hpaste.fcgi/view?id=11863#a11863
I've added a revision of your code and some highlights. The main one is to suggest foldl' instead of foldr'. However without the input text helping to improve is harder. -- Nicolas Pouillard http://nicolaspouillard.fr

Aleksandar Dimitrov wrote:
The important bits are as follows:
mf :: [C.ByteString] -> StdWord mf [] = Word [] C.empty mf s = Word (tail s) (head s)
f' = mf . reverse . C.words
main :: IO () main = do corpus_name <- liftM head getArgs corpus <- liftM (Corpus . (map f') . C.lines) $ C.readFile corpus_name print $ length (content corpus) let interesting = filterForInterestingTags interestingTags corpus print $ show (freqMap interesting)
[...]
Ideally, only a very smart part of the file should ever be in memory, with processing happening incrementally!
The print $ length (content corpus) statement seems contradictory to your goal? After all, the whole file is read into the corpus variable to calculate its length . Regards, apfelmus -- http://apfelmus.nfshost.com
participants (3)
-
Aleksandar Dimitrov
-
Heinrich Apfelmus
-
Nicolas Pouillard