Re: [Haskell-cafe] How on Earth Do You Reason about Space?

1 Jun 2011

      Just out of interest, did you try reading the input as plain old Strings?  They may be unfashionable these days, and have their own type of badness in space and time performance, but might perhaps be a win over ByteStrings for extremely large datasets.
Regards,
    Malcolm

On 01 Jun, 2011,at 02:49 PM, Aleksandar Dimitrov  wrote:

Hi John,
...
I think the issue is data sharing, as Brandon mentioned. A bytestring
consists of an offset, length, and a pointer. You're using a chunk size of
64k, which means the generated bytestrings all have a pointer to that 64k of
data. Suppose there's one new word in that 64k, and it's near the beginning
of the chunk. Even though the word may only be a few characters long, it'll
reference the entire chunk and keep it from being GC'd.
This seems to be the issue indeed! If the bytestrings on the hash map are
holding references to the chunks, it is clear that we're going to consume memory
scaling with the size of the input file, in case there are *any* new chunks
generated.

As I understand it, this what not a problem with the multiple copies of War and
Peace, because all byte strings were already found on the hash table! On reading
new input, old entries were found on the hash table, so only old chunks were
kept in memory, the new ones could be gc'ed.

In *realistic* data, however, the Long Tail is the reason that, after a while,
some chunks of input would only be kept because there were a few byte strings
referencing them. New words are rare, but they need not occur more frequently
than once per chunk, in order to keep the whole chunk in memory, even though the
chunk was mostly useless (most other data in this chunk would already be on the
hash map.)
...
There are a few solutions to this. The first is to make a copy of the
bytestring so only the required data is retained. In my experiments this
wasn't helpful, but it would depend on your corpus. The second is to start
with smaller chunks. Using a chunk size of 1024 worked fairly well for me.
If your corpus is similar to natural language, I think it'll probably work
better for you as well.
I think I solved this problem elegantly: I used Data.Text as hash map keys,
instead of Data.ByteString. See the modified program below:
...
import qualified Data.Iteratee as I
import Data.IterateeChar
import Data.Iteratee.IO
import qualified Data.HashMap.Strict as T
import Data.Ord (comparing)
import Data.List (sortBy)
import System.Environment (getArgs)
import qualified DataByteString as S
import qualified Data.Text as X
import Data.Text.Encoding (decodeUtf8)
type Wordcounts = T.HashMap X.Text Int
f' :: Monad m => I.Iteratee S.ByteString m Wordcounts
f' = I.joinI $ (enumLinesBS I.><> I.filter (not.S.null)) $ I.foldl'
(\t s -> T.insertWith (+) (convert s) 1 t) T.empty
where convert = decodeUtf8
main :: IO ()
main = getArgs >>= fileDriverVBuf 65536 f'.head
...
...
= mapM_ (\(w,c)-> putStrLn $ X.unpack w ++ "\t" ++ show c)sortM
where sortM = sortBy (comparing snd) . T.toList
Initial benchmarks on the realistic 100MB Gutenberg corpus I downloaded with my
script yesterday report the following: htop says 120M memory residency towards
the end of the life-cycle.

<>

19MB avg, 44MB max residency, 133M in use (which is similar to what htop told
me) and the heap profile clearly shows allocation and deallocation of the
bytestrings: http://imgur.com/U1nyw (I'm attributing the ruggedness of the
profile with the frequent little spikes to the iteratee chunk allocation and
deallocation.) It seems I can't get rid of 50% of the heap being lag state. I
don't quite understand yet what that is. I also don't know what INHERENT_USE is.

But in any case, I now have a program that I can reasonably expect to run if I
could fit the input file into memory. I might try to implement an analogous
program in C++ or Java, just to see whether that would do better or similarly in
terms of memory consumption.
...
Note that Johan's Ngram code also only keeps the minimum required data,
giving it a good memory profile. I didn't notice this last night because I
was testing with different data, and unfortunately the peculiar distribution
of that data masked this problem.
This is kind of the big problem here: whether or not you'll see the particular
behaviour I was so upset about seemed to depend on the corpus' distributional
properties.

In any case, I would like to thank you all for helping me understand and address
the issue. I probably still have a long way to go in terms of understanding
space-behaviour of Haskell programs, but at least I don't see ByteStrings as
this black box of efficiency anymore, but actually understand how they're
structured, and what they're good at, and what they aren't good at.

(take home lesson: Data.Text is really nice. Also: if iteratee has a space leak,
I probably didn't hit it, really. But: if reading byte-strings, one should mind
the pointers that byte-strings actually are.)

Regards,
Aleks
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] How on Earth Do You Reason about Space?

malcolm.wallace