New subject: Iteratees again (Was: How on Earth Do You Reason about Space?)

31 May 2011

      Dear Cafe,

(Excuse the probably very ranty email; I am, unfortunately, at the end of my
wits, and I hope that as fellow programmers, you will understand that this is
among the most dreadful situations for our kind to be in.)

Say, we have an input file that contains a word per line. I want to find all
unigrams (unique words) in that file, and associate with them the amount of
times they occurred in the file. This would allow me, for example, to make a
list of word frequencies in a given text.

Simple enough task. Here's an implementation using iteratees (lazy IO is evil)
and unordered-containers' Data.HashMap.Strict, which enforces WHNF in values
and keys:
...
import qualified Data.ByteString.Char8 as S
import qualified Data.Iteratee as I
import Data.Iteratee.IO
import qualified Data.HashMap.Strict as T
import Data.Iteratee.Char
import Data.List (foldl')
import Data.Char (toLower)
import Data.Ord (comparing)
import Data.List (sortBy)
import System.Environment (getArgs)
type Wordcounts = T.HashMap S.ByteString Int
f' :: Monad m => I.Iteratee S.ByteString m Wordcounts
f' = I.joinI $ enumLinesBS (I.liftI $ step T.empty)
    where step t (I.Chunk str) = I.liftI (step $! foldl' maybeIncrement t str)
          step t stream = I.idone t stream
          maybeIncrement t s
              | s == S.empty = t
              | otherwise    = {-# SCC "m-I-other" #-} T.insertWith (+) s 1 t
main :: IO ()
main = getArgs >>= fileDriverVBuf 65536 f'.head >>= print.prettyList
    where prettyList = -- sortBy (comparing snd) . T.toList
                       T.keys
Some lines are empty, and I don't want them to be recorded, so that's why
maybeIncrement is necessary.
hpaste of this code: http://hpaste.org/47300/spaceleak (ignore convert, that's
yet another issue.)

Now, here's some observations: on a 75M input file (minuscule, compared to what
I actually need) this program will eat 30M of heap space (says profiling) and
return in 14 secs.

I have two problems with that: a) that's too much heap space, b) the actual memory
residency is *much* worse.

ad b) average memory residency is at 38MB (this is OK, given heap consumption)
but max residency is at 130MB, which is unacceptable to me (remember that I need
to run this on files *much* bigger than just 75M.)

<>

In fact, htop reports total memory residency of the program at around 320 MB at
the end of its life-cycle (compiled and ran without profiling options.) I tried
running this program on a 1.4GB file, and had to kill it when at 3.5GB memory
consumption, my computer started paging. The actual hash-map, however, shouldn't
be much bigger than in the 75MB case (I expect about twice the size,) since the
input is natural language. I redirected the output of the program (showing a
list of assoc pairs that were in the hash map) to a file, and that file measured
11MB in the case of a 75MB corpus, and 40MB when I ran the program on a 1.4GB
corpus (I had to use a machine with 24GB of RAM to be able to run this.)

ad a) heap consumption is too high for two reasons: firstly, the actual data I
care about is much less than there's data on the heap. Secondly, about half the
heap space is in LAG state. Here are profiles that will illustrate this:
http://imgur.com/wBWmJ&XN1mW<YNR.

- The first image shows 50% of the heap space being gobbled up with data that
  shouldn't be there anymore (LAG)
- The second image shows the types that are in LAG state: ByteString and
  HashMap.  So, it seems I'm keeping around hash maps?
- The third image blames the cost centre m-i-other for the LAG.
- Then there's the retainer profile: I have no idea how to read it. I thought it
  had something to do with iteratee reading in chunks of data, but after setting
  the chunk size to 4096 bytes and actually getting *less* spikes than before, I
  doubt that really is the case.

It seems like half of the data this program puts on the heap is not cleaned up
properly, and the otherwise-branch of maybeIncrement (i.e. the only really
important line is this program) is to blame. I'm thinking there's too little
strictness somewhere in the T.insertWith clause, but I have no idea how to make
it better.  Honestly, sprinkling around magic seq-dust every time something goes
wrong doesn't feel good. I cannot reason about Haskell's space-performance even
a little bit, and I have no idea how to change that.

Some more remarks: 
- I tried using lazy IO first, but that was a total disaster. Iteratee is nice,
  and works very well, but maybe I don't know how to use it?
- In terms of map-implementation, I observed that unordered-containers <
  Data.Map < bytestring-trie. It surprised me to see that Data.Map would perform
  better than bytestring-trie for this task (usually, Data.Map would be around
  twice as fast.)
- I should probably look into using a mutable data structure as a map. I would
  find it a bit disenchanting to have to use a mutable data structure this way,
  but maybe that's just reality for you.

RANT

I have tried and tried again to avoid writing programs in Haskell that would
leak space like BP likes to leak oil. However, I have yet to produce a single
instance of a program that would do anything at all and at the same time consume
less memory than there is actual data in the input file.

It is very disconcerting to me that I seem to be unable, even after quite some
practice, to identify space leaks in trivial programs like the above. I know of
no good resource to educate myself in that regard. I have read the GHC manual,
RWH's chapter on profiling, also "Inside T5"'s recent series on the Haskell
heap, but no dice. Even if I can clearly see the exact line where at least some
of the leaking happens (as I can in this case,) it seems impossible for me to
prevent it.

*thank you very much* for reading this far. This is probably a mostly useless
email anyhow, I just had to get it off my chest. Maybe, just maybe, someone
among you will have a crucial insight that will save Haskell for me :-) But
currently, I see no justification to not start my next project in Lua, Python or
Java. Sure, Haskell's code is pretty, and it's fun, but if I can't actually
*run* it, why bother?  (Yes, this isn't the first time I've ran into this
problem …)

Thanks, and kindest regards to you all,
Aleks

How on Earth Do You Reason about Space?

Aleksandar Dimitrov

Aleksandar Dimitrov

malcolm.wallace

Aleksandar Dimitrov

Johan Tibell

Aleksandar Dimitrov

Johan Tibell

Ketil Malde

Ketil Malde

Aleksandar Dimitrov

wren ng thornton

dm-list-haskell-cafe＠scs.stanford.edu

Ketil Malde

Brandon Allbery

Edward Z. Yang

Aleksandar Dimitrov

Brandon Moore

Aleksandar Dimitrov

Brandon Moore

Ketil Malde

tags

participants (9)