How on Earth Do You Reason about Space?

Dear Cafe, (Excuse the probably very ranty email; I am, unfortunately, at the end of my wits, and I hope that as fellow programmers, you will understand that this is among the most dreadful situations for our kind to be in.) Say, we have an input file that contains a word per line. I want to find all unigrams (unique words) in that file, and associate with them the amount of times they occurred in the file. This would allow me, for example, to make a list of word frequencies in a given text. Simple enough task. Here's an implementation using iteratees (lazy IO is evil) and unordered-containers' Data.HashMap.Strict, which enforces WHNF in values and keys:
import qualified Data.ByteString.Char8 as S import qualified Data.Iteratee as I import Data.Iteratee.IO
import qualified Data.HashMap.Strict as T
import Data.Iteratee.Char import Data.List (foldl') import Data.Char (toLower)
import Data.Ord (comparing) import Data.List (sortBy) import System.Environment (getArgs)
type Wordcounts = T.HashMap S.ByteString Int
f' :: Monad m => I.Iteratee S.ByteString m Wordcounts f' = I.joinI $ enumLinesBS (I.liftI $ step T.empty) where step t (I.Chunk str) = I.liftI (step $! foldl' maybeIncrement t str) step t stream = I.idone t stream maybeIncrement t s | s == S.empty = t | otherwise = {-# SCC "m-I-other" #-} T.insertWith (+) s 1 t
main :: IO () main = getArgs >>= fileDriverVBuf 65536 f'.head >>= print.prettyList where prettyList = -- sortBy (comparing snd) . T.toList T.keys
Some lines are empty, and I don't want them to be recorded, so that's why
maybeIncrement is necessary.
hpaste of this code: http://hpaste.org/47300/spaceleak (ignore convert, that's
yet another issue.)
Now, here's some observations: on a 75M input file (minuscule, compared to what
I actually need) this program will eat 30M of heap space (says profiling) and
return in 14 secs.
I have two problems with that: a) that's too much heap space, b) the actual memory
residency is *much* worse.
ad b) average memory residency is at 38MB (this is OK, given heap consumption)
but max residency is at 130MB, which is unacceptable to me (remember that I need
to run this on files *much* bigger than just 75M.)
<

On Tue, May 31, 2011 at 06:10:00PM +0200, Aleksandar Dimitrov wrote:
ad a) heap consumption is too high for two reasons: firstly, the actual data I care about is much less than there's data on the heap. Secondly, about half the heap space is in LAG state. Here are profiles that will illustrate this: http://imgur.com/wBWmJ&XN1mW<YNR.
Pardon me, I forgot to actually add the retainer profile: - http://i.imgur.com/xLO13 - http://imgur.com/wBWmJ&XN1mW<YNR&xLO13l?full is the full gallery.

ad a) heap consumption is too high for two reasons: firstly, the actual data I care about is much less than there's data on the heap. Secondly, about half the heap space is in LAG state. Here are profiles that will illustrate this: http://imgur.com/wBWmJ&XN1mW<YNR. - The first image shows 50% of the heap space being gobbled up with data that shouldn't be there anymore (LAG) - The second image shows the types that are in LAG state: ByteString and HashMap. So, it seems I'm keeping around hash maps? In Lag/Drag/Void/Use profiling, Lag is actually heap cells that are created too _early_. (Drag are those that are kept for longer than necessary.) Lots of Lag generally means your program is too strict - it is forcing structure long before it needs to. To fix it, you need to make things lazier. My first suspicion would fall on ByteString. Regards, Malcolm

In Lag/Drag/Void/Use profiling, Lag is actually heap cells that are created too _early_. (Drag are those that are kept for longer than necessary.) Lots of Lag generally means your program is too strict - it is forcing structure long before it needs to. To fix it, you need to make things lazier. My first suspicion would fall on ByteString.
Indeed, thank you, I mixed those up. I cannot use lazy byte strings here, because of the way Data.Iteratee.Char's enumLinesBS works (it takes strict byte strings.) The only other strictness in there is ($!) and foldl'. The latter is necessary for the program to even run (i.e. not run out of stack space.) The strict application in step's argument seems necessary, since without it, the program consumes 1200 MB of RAM (on my 75MB test data,) and takes very very long. The hb profile indicates that a lot of data is allocated up front, and then gradually eliminated. Interestingly, removing ($!) here seemed to *introduce* unnecessary strictness. Here's the hb profile without ($!): http://imgur.com/Ex7Pd I don't understand what is happening here :-\ I only just started using iteratees. Regards, Aleks

Hi Aleksandar,
On Tue, May 31, 2011 at 6:10 PM, Aleksandar Dimitrov
Say, we have an input file that contains a word per line. I want to find all unigrams (unique words) in that file, and associate with them the amount of times they occurred in the file. This would allow me, for example, to make a list of word frequencies in a given text.
Here's how I would do it: {-# LANGUAGE BangPatterns #-} module Ngram (countUnigrams) where import qualified Data.ByteString as S import qualified Data.HashMap.Strict as M import System.IO foldLines :: (a -> S.ByteString -> a) -> a -> Handle -> IO a foldLines f z0 h = go z0 where go !z = do eof <- hIsEOF h if eof then return z else do line <- S.hGetLine h go $ f z line {-# INLINE foldLines #-} -- Example use countUnigrams :: IO (M.HashMap S.ByteString Int) countUnigrams = foldLines (\ m s -> M.insertWith (+) s 1 m) M.empty stdin
RANT
I have tried and tried again to avoid writing programs in Haskell that would leak space like BP likes to leak oil. However, I have yet to produce a single instance of a program that would do anything at all and at the same time consume less memory than there is actual data in the input file.
It is very disconcerting to me that I seem to be unable, even after quite some practice, to identify space leaks in trivial programs like the above. I know of no good resource to educate myself in that regard. I have read the GHC manual, RWH's chapter on profiling, also "Inside T5"'s recent series on the Haskell heap, but no dice. Even if I can clearly see the exact line where at least some of the leaking happens (as I can in this case,) it seems impossible for me to prevent it.
*thank you very much* for reading this far. This is probably a mostly useless email anyhow, I just had to get it off my chest. Maybe, just maybe, someone among you will have a crucial insight that will save Haskell for me :-) But currently, I see no justification to not start my next project in Lua, Python or Java. Sure, Haskell's code is pretty, and it's fun, but if I can't actually *run* it, why bother? (Yes, this isn't the first time I've ran into this problem …)
We definitely need more accessible material on how to reliably write fast Haskell code. There are those among us who can, but it shouldn't be necessary to learn it in the way they did (i.e. by lots of tinkering, learning from the elders, etc). I'd like to write a 60 (or so) pages tutorial on the subject, but haven't found the time. In addition to RWH, perhaps the slides from the talk on high-performance Haskell I gave could be useful: http://blog.johantibell.com/2010/09/slides-from-my-high-performance-haskell.... Cheers, Johan

Hi Johan,
Here's how I would do it:
I implemented your method, with these minimal changes (i.e. just using a main driver in the same file.)
countUnigrams :: Handle -> IO (M.Map S.ByteString Int) countUnigrams = foldLines (\ m s -> M.insertWith (+) s 1 m) M.empty
main :: IO () main = do (f:_) <- getArgs openFile f ReadMode >>= countUnigrams >>= print . M.toList
It seems to perform about 3x worse than the iteratee method in terms of time, and worse in terms of space :-( On Brandon's War & Peace example, hGetLine uses 1.565 seconds for the small file, whereas my iteratee method uses 1.085s for the small file, and around 2 minutes for the large file. For the large file, the code above starts consuming around 2.5GB of RAM, so it clearly has a space leak somewhere. Where, I don't know. If you want to try it out, here's a short command line to make a test corpus the way Brandon made one: +++ wget 'http://www.gutenberg.org/files/2600/2600.zip'; unzip 2600.zip; touch wnp100.txt; for i in {1..100}; do echo -n "$i "; cat 2600.txt >> wnp100.txt; done; echo "Done. +++ Note, that, as I detailed in my prior email to Brandon, even if you do end up with a (supposedly) non-leaking program for this example corpus, that doesn't mean it'll scale well to real world data. I also tried sprinkling strictness annotations throughout your above code, but I failed to produce good results :-(
We definitely need more accessible material on how to reliably write fast Haskell code. There are those among us who can, but it shouldn't be necessary to learn it in the way they did (i.e. by lots of tinkering, learning from the elders, etc). I'd like to write a 60 (or so) pages tutorial on the subject, but haven't found the time.
I'd be an eager reader :-) Please do announce it on -cafe or the "usual places" should you ever come around to writing it! I, unfortunately, don't really have any contact to "the elders," apart from what I read on their respective blogs…
In addition to RWH, perhaps the slides from the talk on high-performance Haskell I gave could be useful:
http://blog.johantibell.com/2010/09/slides-from-my-high-performance-haskell....
Thanks, I'll give it a look later tomorrow! Regards, Aleks PS: Sorry I didn't answer you in #haskell, I ended up having to go afk for a short while. Thanks for all your help!

Hi Aleks,
On Wed, Jun 1, 2011 at 12:14 AM, Aleksandar Dimitrov
I implemented your method, with these minimal changes (i.e. just using a main driver in the same file.)
countUnigrams :: Handle -> IO (M.Map S.ByteString Int) countUnigrams = foldLines (\ m s -> M.insertWith (+) s 1 m) M.empty
main :: IO () main = do (f:_) <- getArgs openFile f ReadMode >>= countUnigrams >>= print . M.toList
It seems to perform about 3x worse than the iteratee method in terms of time, and worse in terms of space :-( On Brandon's War & Peace example, hGetLine uses 1.565 seconds for the small file, whereas my iteratee method uses 1.085s for the small file, and around 2 minutes for the large file.
That's curious. I chatted with Duncan Coutts today and he mentioned that hGetLine can be a bit slow as it needs to take a lock in every read and causes some copying, which could explain why it's slower than iteratee which works in blocks. However, I don't understand why it uses more memory. The ByteStrings that are returned by hGetLine should have an underlying storage of the same size as the ByteString (as reported by length). You can try to verify this by calling 'copy' on the ByteString before inserting it. It looks like hGetLine needs some love.
I also tried sprinkling strictness annotations throughout your above code, but I failed to produce good results :-(
The strictness of the code I gave should be correct. The problem should be elsewhere.
I, unfortunately, don't really have any contact to "the elders," apart from what I read on their respective blogs…
You and everyone else. :) I just spent enough time talking to people on IRC, reading good code, blogs and mailing list posts. I think Bryan described the process pretty well in his CUFP keynote: http://www.serpentine.com/blog/2009/09/23/video-of-my-cufp-keynote/ Cheers, Johan

Since frequency counts are an important use of map-like data structures, I did a brief test of the available options. First using regular strings for input, and Data.Map.fromListWith - i.e. the operational bit being: freqs :: [String] -> M.Map String Int freqs = M.fromListWith (+) . map (,1) This runs on a 14M corpus consisting of King James Bible, collected works of Shakespeare, and War and Peace. ./freqstr1 +RTS -s 5,093,386,368 bytes allocated in the heap 2,682,667,904 bytes copied during GC 261,110,368 bytes maximum residency (20 sample(s)) 9,018,000 bytes maximum slop 623 MB total memory in use (10 MB lost due to fragmentation) ./freqstr1 +RTS -s 21.43s user 0.78s system 99% cpu 22.285 total Kinda expensive, 250MB to store word frequencies of 14MB text. Now, changing to freqs :: [String] -> M.Map String Int freqs = foldl' (\m w -> M.insertWith' (+) w 1 m) M.empty i.e. using strict insertion, avoiding the buildup of lazy thunks for the counts. ./freqstr2 +RTS -s -- strings, using strict insertion 4,754,110,096 bytes allocated in the heap 2,089,527,240 bytes copied during GC 27,039,112 bytes maximum residency (66 sample(s)) 613,192 bytes maximum slop 80 MB total memory in use (2 MB lost due to fragmentation) ./freqstr2 +RTS -s 17.48s user 0.13s system 99% cpu 17.665 total This reduced maximam memory consumption to one tenth, still bigger than input corpus, but clearly not too bad. A bit faster, too, in spite of probably doing more work. Using ByteStrings instead, first fromListWith: ./freq +RTS -s (Just 77432,113931) 3,880,059,568 bytes allocated in the heap 1,467,507,808 bytes copied during GC 174,573,752 bytes maximum residency (14 sample(s)) 8,222,600 bytes maximum slop 385 MB total memory in use (6 MB lost due to fragmentation) ./freq +RTS -s 14.26s user 0.49s system 99% cpu 14.798 total About half the memroy of Strings, and 25% faster. With strict insertion: ./freq2 +RTS -s -- map using strict insertion 3,761,614,312 bytes allocated in the heap 849,806,000 bytes copied during GC 23,950,328 bytes maximum residency (35 sample(s)) 2,376,904 bytes maximum slop 58 MB total memory in use (1 MB lost due to fragmentation) ./freq2 +RTS -s 11.14s user 0.13s system 99% cpu 11.295 total Parallel to the String case, this is a lot more frugal with memory, and 30% faster. Now, I tried Data.HashMap from the hashmap library: ./freqH1 +RTS -s -- hashmap using fromListWith 4,552,922,784 bytes allocated in the heap 2,990,287,536 bytes copied during GC 401,247,912 bytes maximum residency (14 sample(s)) 42,098,016 bytes maximum slop 957 MB total memory in use (15 MB lost due to fragmentation) ./freqH1 +RTS -s 15.68s user 1.53s system 99% cpu 17.277 total ./freqH2 +RTS -s -- hashmap using foldl' insertWith 4,518,146,968 bytes allocated in the heap 2,986,973,352 bytes copied during GC 394,502,832 bytes maximum residency (14 sample(s)) 41,020,040 bytes maximum slop 957 MB total memory in use (15 MB lost due to fragmentation) ./freqH2 +RTS -s 15.86s user 1.62s system 99% cpu 17.537 total HashMap doesn't provide a strict insertWith, so this is similar to the lazy insertions above. A bit worse, actually, probably due to the overhead of hashing. Then, I discovered that Johan's hashmap is a different library, and thought I'd try that too for completeness. ./freqHS +RTS -s -- hashmap strict (unordered-containers) 2,628,628,752 bytes allocated in the heap 945,571,872 bytes copied during GC 26,635,744 bytes maximum residency (32 sample(s)) 2,433,504 bytes maximum slop 66 MB total memory in use (1 MB lost due to fragmentation) ./freqHS +RTS -s 6.90s user 0.16s system 99% cpu 7.082 total Memory residency like the other strict versions, but really fast, probably due to faster comparisons of hash values vs comparisons of strings. Conclusion: make sure you are using a strict map, and if your keys are strings or otherwise have expensive comparisons, unordered-containers is the library for you. -k PS: I also tried mapping 'copy' on the input words to avoid storing large slices of the input, but it only worsened things: ./freqHS3 +RTS -s (Just 77432,113931) 3,109,585,024 bytes allocated in the heap 936,724,184 bytes copied during GC 87,831,888 bytes maximum residency (19 sample(s)) 8,835,440 bytes maximum slop 164 MB total memory in use (3 MB lost due to fragmentation) ./freqHS3 +RTS -s 12.71s user 0.31s system 99% cpu 13.060 total Perhaps if you managed to only copy new words it would look better? PPS: I tried to be careful juggling the results around, but there's always the possiblity of a mistake. Caveat lector! (Or should that be 'cave scriptor'?) PPPS: There are some small interface annoyances around, it'd be nice if I could experiment with this by only changing the imports. Would it be possible to agree on an interface to map-like structures? -- If I haven't seen further, it is by standing in the footprints of giants

By the way, what is the advantage of using iteratees here? For my testing, I just used: main = printit . freqs . B.words =<< B.readFile "words" (where 'printit' writes some data to stdout just to make sure stuff is evaluated, and you've already seen some 'freqs' examples) I have a bunch of old code, parsers etc, which are based on the 'readFile' paradigm: type Str = Data.ByteString.Lazy.Char8.ByteString -- usually decodeFoo :: Str -> Foo encodeFoo :: Foo -> Str readFoo = decodeFoo . readFile writeFoo f = writeFile f . encodeFoo hReadFoo = decodeFoo . hRead : (etc) This works pretty well, as long as Foo is strict enough that you don't retain all or huge parts of input, and as long as you can process input in a forward, linear fashion. And, like my frequency count above, I can't really see how this can be made much simpler. I haven't used iteratees or enumartors in anger, but it appears to me that they are most useful when the input is unpredictable or needs to be controlled in some way - for instance, when recv() can return a blocks of data that may be too little or too much. Would there be any great advantage to rewriting my stuff to use iterators? Or at least, use iterators for new stuff? As I see it, iterators are complex and the dust is only starting to settle on implementations and interfaces, and will introduce more dependencies. So my instinct is to stick with the worse-is-better approach, but I'm willing to be educated. -k -- If I haven't seen further, it is by standing in the footprints of giants

Hi Ketil,
By the way, what is the advantage of using iteratees here? For my testing, I just used:
My initial move to iteratees was more a clutch call I made when I was still using bytestring-trie, and was having immense memory consumption problems. bytestring-trie uses strict byte strings as an index, and since I was getting only lazy byte strings, the only way to make them strict would be to use (S.concat . L.toChunks) (L and S being the lazy/strict byte string imports,) which felt *wrong*. In short, I thought iteratee would give me enough magic fairy dust to actually have a decent control over how much data I'm holding in RAM at any given point — that was not the case, since I didn't know about the pointer mechanic of strict ByteStrings and hence was oblivious to the bad impact that would have on garbage collection performance. Even so, I think I can still justify using iteratees in the current design: a) I don't like lazy IO (conceptually,) b) I'm gonna write a left-fold somewhere anyway, might as well use a decent infrastructure for it c) I can strictly control the chunk size, and I'm not going to have any bad effects with accidental eager evaluation somewhere down the pipe. c) being the only "legitimate" reason (though the reason for a) is c) ) — adjusting the chunk size might actually yield noticeable performance differences when reading through files that are well into the realm of gigabytes. And the chunk size "limit" will protect me from an accidental strict fold or so that would leave me with a 4GB file in memory. About a): Lazy IO just doesn't "feel" right for me. I want my pure computations to actually be pure. If I put a ' on one of my functions *within* my pure code, this might have *side effects* — now, instead of reading in only part of the file, this will demand the *whole* file, and that is *quite* a side effect! So, suddenly I have to worry about side effects in my pure code. Ugh. That's why I'm going to continue using iteratees. I don't know if that's the right justification, but it's a "hey, it works for me!" justification I can comfortably live with. Besides, I don't think the iteratee interface is all that opaque. I found arrows in HXT, for example, much more difficult to deal with conceptually. (That said, I'm still using HDBC over Takusen, because the latter's API just didn't make sense to me.) Regards, Aleks

On 6/2/11 8:59 AM, Aleksandar Dimitrov wrote:
Hi Ketil,
By the way, what is the advantage of using iteratees here? For my testing, I just used:
My initial move to iteratees was more a clutch call I made when I was still using bytestring-trie, and was having immense memory consumption problems.
bytestring-trie also (intentionally) uses ByteString slicing in order to minimize copying. It does so semi-intelligently--- ensuring that of the two sharing options it chooses the one with a shorter "spur". About half the time that spur will be used by continuing down the trie, but you can still end up with unwanted overhead especially if you're intermittently removing keys from the trie. I've been meaning to add functions to remove spurs (because sometimes the memory is more important than the running time) and meaning to add various other upgrades. I've just been too busy with other code for the last year or so. -- Live well, ~wren

At Thu, 02 Jun 2011 13:52:52 +0200, Ketil Malde wrote:
I have a bunch of old code, parsers etc, which are based on the 'readFile' paradigm:
type Str = Data.ByteString.Lazy.Char8.ByteString -- usually
decodeFoo :: Str -> Foo encodeFoo :: Foo -> Str
readFoo = decodeFoo . readFile writeFoo f = writeFile f . encodeFoo hReadFoo = decodeFoo . hRead : (etc)
This works pretty well, as long as Foo is strict enough that you don't retain all or huge parts of input, and as long as you can process input in a forward, linear fashion. And, like my frequency count above, I can't really see how this can be made much simpler.
This is fine if you never have parse errors and always read to the end of the file. Otherwise, the code above is incorrect and ends up leaking file descriptors. In general, it is very hard to write parsers that parse every possible input and never fail. Thus, for anything other than a toy program, your code actually has to be: readFoo path = bracket (hOpen path) hclose $ hGetContents >=> (\s -> return $! decodeFoo s) Which is still not guaranteed to work if Foo contains thunks, so then you end up having to write: readFoo path = bracket (hOpen path) hclose $ \h -> do s <- hGetContents h let foo = decodeFoo s deepseq foo $ return foo Or, finally, what a lot of code falls back to, inserting gratuitous calls to length: readFoo path = bracket (hOpen path) hclose $ \h -> do s <- hGetContents h length s `seq` return decodeFoo s The equivalent code with the iterIO package would be: readFoo path = enumFile path |$ fooI which seems a lot simpler to me...
Would there be any great advantage to rewriting my stuff to use iterators? Or at least, use iterators for new stuff?
In addition to avoiding edge cases like leaked file descriptors and memory, one of the things I discovered in implementing iterIO is that it's really handy to have your I/O functions be the same as your parsing combinators. So iteratees might actually admit a far simpler implementation of decodeFoo/fooI. More specifically, imagine that you have decodeFoo, and now want to implement decodeBar where a Bar includes some Foos. Unfortunately, having an implementation of decodeFoo in-hand doesn't help you implement decodeBar. You'd have to re-write your function to return residual input, maybe something like: decodeFooReal :: String -> (Foo, String) decodeFoo :: String -> Foo decodeFoo = fst . decodeFooReal and now you implement decodeBar in terms of decodeFooReal, but you have to pass around residual input explicitly, handle parsing failures explicitly, etc.
As I see it, iterators are complex and the dust is only starting to settle on implementations and interfaces, and will introduce more dependencies. So my instinct is to stick with the worse-is-better approach, but I'm willing to be educated.
I fully agree with the point about dependencies and waiting for the dust to settle, though I hope a lot of that changes in a year or so. However, iterIO should already significantly reduce the complexity. David

dm-list-haskell-cafe@scs.stanford.edu writes:
leaking file descriptors
...until they are garbage collected. I tend to consider the OS fd limitation an OS design error - I've no idea why there should be some arbitrary limit on open files, as long as there is plenty of memory around to store them. But, well, yes, it is a real concern.
parsers that parse every possible input and never fail.
I guess I need to look into how iteratees handle parse failure. Generally, for me a parse failure means program failure - either the data is corrupt, or the program is incorrect.
Thus, for anything other than a toy program, your code actually has to be:
readFoo path = bracket (hOpen path) hclose $ hGetContents >=> (\s -> return $! decodeFoo s)
No, I can't do that in general, because I want to process a Foo (which typically is or contains a list of records) incrementally. I can't assume the file or its data are smalle enough to fit in memory. It is important that readFoo returns a structure that can be consumed lazily - or perhaps it can be iteratee all the way up.
Which is still not guaranteed to work if Foo contains thunks, so then you end up having to write:
readFoo path = bracket (hOpen path) hclose $ \h -> do s <- hGetContents h let foo = decodeFoo s deepseq foo $ return foo
I think this - or rather, having Foo's records be strict - is a good idea anyway. The previous discussion about frequency counts seems to indicate that this goes equally well for iteratees. Thanks for the elaborate answer. -k -- If I haven't seen further, it is by standing in the footprints of giants

On Fri, Jun 3, 2011 at 03:53, Ketil Malde
dm-list-haskell-cafe@scs.stanford.edu writes:
leaking file descriptors
...until they are garbage collected. I tend to consider the OS fd limitation an OS design error - I've no idea why there should be some arbitrary limit on open files, as long as there is plenty of memory around to store them. But, well, yes, it is a real concern.
In the case of Unix, available memory was indeed the motivating factor. The DEC minicomputers it was developed on didn't have a whole lot of memory, plus older Unix reallocated the per-process file structures as part of the (struct proc) for speed (again, old slow systems). The modern reason for limits is mostly to avoid runaway processes. Usually the hard limit is set pretty high but the soft limit is lower.

Hello Aleksandar, It is possible that the iteratees library is space leaking; I recall some recent discussion to this effect. Your example seems simple enough that you might recompile with a version of iteratees that has -auto-all enabled. Unfortunately, it's not really a safe bet to assume your libraries are leak free, and if you've pinpointed it down to a single line, and there doesn't seem a way to squash the leak, I'd bet it's the library's fault. Edward

On Tue, May 31, 2011 at 02:13:14PM -0400, Edward Z. Yang wrote:
It is possible that the iteratees library is space leaking; I recall some recent discussion to this effect. Your example seems simple enough that you might recompile with a version of iteratees that has -auto-all enabled.
If I understand you correctly, you imply that I should try compiling iteratee with profiling, no? I did install the iteratee library with profiling support (I have the cabal profiling flag globally set in my cabal config,) but my profiles so far seem to be blaming LAGging ByteStrings and HashMaps. I, unfortunately, do not know how I would test iteratee itself for a space leak here.
Unfortunately, it's not really a safe bet to assume your libraries are leak free, and if you've pinpointed it down to a single line, and there doesn't seem a way to squash the leak, I'd bet it's the library's fault.
Since my knowledge of Haskell, and, in particular, high-performance Haskell, is very lacking, my current m.o. is to blame myself :-) It might be iteratee, but unfortunately, I have not found something that gives me better performance than iteratee yet. Regards, Aleks

I can't reproduce heap usage growing with the size of the input file. I made a word list from Project Gutenberg's copy of "War and Peace" by tr -sc '[[:alpha:]]' '\n' < pg2600.txt > words.txt Using 1, 25, or 1000 repetitions of this ~3MB wordlist shows about 100MB of address space used according to top, and no more than 5MB or so of haskell heap used according to the memory profile, with a flat memory profile. Is your memory usage growing with the size of the input file, or the size of the histogram? I was worried data sharing might mean your keys retain entire 64K chunks of the input. However, it seems enumLines depends on the StringLike ByteString instance, which just converts to and from String. That can't be efficient, but I suppose it avoids excessive sharing. The other thing that occurs to me is that the total size of your keys would also be approximately the size of the input file if you were using plain text without each word split onto a separate line. Brandon

On Tue, May 31, 2011 at 11:43:27AM -0700, Brandon Moore wrote:
I can't reproduce heap usage growing with the size of the input file.
I made a word list from Project Gutenberg's copy of "War and Peace" by
tr -sc '[[:alpha:]]' '\n' < pg2600.txt > words.txt
Using 1, 25, or 1000 repetitions of this ~3MB wordlist shows about 100MB of address space used according to top, and no more than 5MB or so of haskell heap used according to the memory profile, with a flat memory profile.
This will lead to very small variance in your data. The effect I'm experiencing is so small in this case, that it's barely observable (but it is, see below.)
Is your memory usage growing with the size of the input file, or the size of the histogram?
While the histogram is naturally growing with the size of the input file, memory
usage seems to be proportional mainly to the histogram. It is clear that, due to
the effect of the Long Tail, the histogram is going to constantly grow in a real
setting, as opposed to just replicating the same data. In your test case, the
histogram is *not* growing with the size of the input file.
The memory usage is proportional to the histogram, which is proportional to the
file size. That is not the problem. The problem is, that, compared to the size
of the histogram, the memory consumption is *inadequately* high. Here's some
more data, using your Tolstoi example:
du file.txt
344M file.txt
<
I was worried data sharing might mean your keys retain entire 64K chunks of the input. However, it seems enumLines depends on the StringLike ByteString instance, which just converts to and from String.
Ouch, that sounds like something worth fixing.
The other thing that occurs to me is that the total size of your keys would also be approximately the size of the input file if you were using plain text without each word split onto a separate line.
Well, I am not. The corpus is a word-per-line corpus, I'm reading a word per line, and adding that to my hash map. This should never result in a data structure even close to the size of the original corpus. It could be, in a very unrealistic worst case scenario. But even a corpus of 30GB of Poe and Heidegger isn't going to make that happen. Furthermore, mine is not such a scenario at all. As I said, if you reduce the corpus to a set of words (i.e. a set of unigrams) you get a 40MB file from a 1.4GB corpus. Why is it, that in order to create that 40MB file from a 1.4GB corpus, my trivial little program needs somewhere north of 6-8 GB of RAM? In this trivial example for War and Peace, why is it that in order to create the unigram table for War and Peace, which ends up being a mere 201KB big, we're chomping through 5MB on average, and nearly 10MB max? That's at least 25 times more than we actually *should* have (yes, I know that the RTS is somewhere there, too, but I think it's not a problem to ignore that for now.) Regards, Aleks

Wait, do ByteStrings show up on a heap profile, if the space is allocated with malloc? Anyway, I think my tests still show that the memory used by the process doesn't grow simply by adding more data, if you are no longer added keys to the map. ----- Original Message -----
From: Brandon Moore
To: Aleksandar Dimitrov ; "haskell-cafe@haskell.org" Cc: Sent: Tuesday, May 31, 2011 1:43 PM Subject: Re: [Haskell-cafe] How on Earth Do You Reason about Space? I can't reproduce heap usage growing with the size of the input file.
I made a word list from Project Gutenberg's copy of "War and Peace" by
tr -sc '[[:alpha:]]' '\n' < pg2600.txt > words.txt
Using 1, 25, or 1000 repetitions of this ~3MB wordlist shows about 100MB of address space used according to top, and no more than 5MB or so of haskell heap used according to the memory profile, with a flat memory profile.
Is your memory usage growing with the size of the input file, or the size of the histogram?
I was worried data sharing might mean your keys retain entire 64K chunks of the input. However, it seems enumLines depends on the StringLike ByteString instance, which just converts to and from String. That can't be efficient, but I suppose it avoids excessive sharing.
The other thing that occurs to me is that the total size of your keys would also be approximately the size of the input file if you were using plain text without each word split onto a separate line.
Brandon

Aleksandar Dimitrov
Now, here's some observations: on a 75M input file (minuscule, compared to what I actually need) this program will eat 30M of heap space (says profiling) and return in 14 secs.
I have two problems with that: a) that's too much heap space, b) the actual memory residency is *much* worse.
30M isn't a lot these days. How does it scale?
ad b) average memory residency is at 38MB (this is OK, given heap consumption) but max residency is at 130MB, which is unacceptable to me (remember that I need to run this on files *much* bigger than just 75M.)
I think max residency (depending on how you measure) can be twice the heap size due to using copying GC. If you run short of memory, the runtime will switch to compacting GC which will be slower but use less memory.
I have tried and tried again to avoid writing programs in Haskell that would leak space like BP likes to leak oil.
I know the feeling. I think making a frequency table ought to be as simple as ... M.fromListWith (+) . map (,1) . words =<< readFile ... or at worst ... foldl' (\m w -> M.insertWith' (+) w 1 m) M.empty . words =<< readFile... (which evaluates things strictly, and at least in my small tests, seem to use quite a bit less heap space). -k -- If I haven't seen further, it is by standing in the footprints of giants
participants (9)
-
Aleksandar Dimitrov
-
Brandon Allbery
-
Brandon Moore
-
dm-list-haskell-cafe@scs.stanford.edu
-
Edward Z. Yang
-
Johan Tibell
-
Ketil Malde
-
malcolm.wallace
-
wren ng thornton