Lazy file IO & Space leaks/waste

5 Nov 2009

      Hello list,

I'm currently writing a small linguistic corpus analyzer. My input file is only
25MB, but profiling shows that the overall amount of allocation over the
program's runtime is several GB. That's a little too much - adding to that is
the fact that the program is abysmally slow, so I'm suspecting a space leak
somewhere.

I'm using the ByteString.Lazy.Char8 class in order to work efficiently with lazy
IO and I must admit that I'm very inexperienced with predicting runtime and
space behaviour of lazy IO :-( It worked well in the past, but I'm stuck now.

The program can be found here:
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=11863#a11863

The important bits are as follows:
...
mf :: [C.ByteString] -> StdWord
mf [] = Word [] C.empty
mf s = Word (tail s) (head s)
f' = mf . reverse . C.words
...
main :: IO ()
main = do
    corpus_name <- liftM head getArgs
    corpus <- liftM (Corpus . (map f') . C.lines) $ C.readFile corpus_name
    print $ length (content corpus)
    let interesting = filterForInterestingTags interestingTags corpus
    print $ show (freqMap interesting)
freqMap also uses foldr' to fold the corpus it is given into a Data.Map, but
according to the profiling output, that's not where the bad stuff happens.
Here's the interesting parts of the profiling output:
...
analyse +RTS -p -RTS corpus/bnc-samp.tt
total time  =       30.46 secs   (1523 ticks @ 20 ms)
total alloc = 14,871,619,904 bytes  (excludes profiling overheads)
COST CENTRE                 MODULE no. entries  %time %alloc %time %alloc
MAIN                        MAIN     1        0   0.0    0.0 100.0  100.0
  main                      Main   221        0   0.0    0.0   0.0    0.0
 CAF                        Main   206       14   0.1    0.1 100.0  100.0
  showsPrec_aSF             Main   223        1   0.0    0.0   0.0    0.0
  interestingTags           Main   218        1   0.0    0.0   0.0    0.0
  f'                        Main   216        1   0.0    0.0   0.0    0.0
   mf                       Main   217        1   0.0    0.0   0.0    0.0
  main                      Main   212        1   2.4    4.4  99.9   99.9
   f'                       Main   219        0  89.0   91.5  90.2   92.7
    mf                      Main   220  2427450   1.2    1.2   1.2    1.2
   filterForInterestingTags Main   215        1   5.4    0.1   5.4    0.1
   freqMap                  Main   214        1   0.6    0.5   2.0    2.7
    compare_aRL             Main   222  1141996   1.4    2.2   1.4    2.2
Obviously the main cost centre seems to be f' (which I've factored out in order
to view it separately as a cost centre.) It's not the call to reverse that makes
it slow (removing it doesn't affect the run time.) Interestingly, it prints out
the first statement (length) very quickly, then takes a lot of time. I've
removed the call to freqMap, and it seems that GHC is smart enough to drop the
call to f' completely, because only the length ever gets evaluated. But the
freqMap also needs the processing from f', and that's where the bad stuff starts
happening.

Should I look into DeepSeq? I've tried adding strictness to the functions by
hand, and also to the data structures, but that didn't seem to help so far. So
I'm looking for solutions to make this faster. I'm guessing that a smart `seq`
somewhere might help, but I don't know where :-\

As a side note: how can I find out how the run-time data structures look like in
memory? I.e. I want to know *what* exactly stays around as thunks in memory so
that I can focus better on where to add strictness or redirect the program flow.
Ideally, only a very smart part of the file should ever be in memory, with
processing happening incrementally!

Thanks for any pointers!
Aleks

Aleksandar Dimitrov

Nicolas Pouillard

Heinrich Apfelmus

tags

participants (3)