
Just out of interest, did you try reading the input as plain old Strings? They may be unfashionable these days, and have their own type of badness in space and time performance, but might perhaps be a win over ByteStrings for extremely large datasets.
Regards,
Malcolm
On 01 Jun, 2011,at 02:49 PM, Aleksandar Dimitrov
I think the issue is data sharing, as Brandon mentioned. A bytestring consists of an offset, length, and a pointer. You're using a chunk size of 64k, which means the generated bytestrings all have a pointer to that 64k of data. Suppose there's one new word in that 64k, and it's near the beginning of the chunk. Even though the word may only be a few characters long, it'll reference the entire chunk and keep it from being GC'd.
This seems to be the issue indeed! If the bytestrings on the hash map are holding references to the chunks, it is clear that we're going to consume memory scaling with the size of the input file, in case there are *any* new chunks generated. As I understand it, this what not a problem with the multiple copies of War and Peace, because all byte strings were already found on the hash table! On reading new input, old entries were found on the hash table, so only old chunks were kept in memory, the new ones could be gc'ed. In *realistic* data, however, the Long Tail is the reason that, after a while, some chunks of input would only be kept because there were a few byte strings referencing them. New words are rare, but they need not occur more frequently than once per chunk, in order to keep the whole chunk in memory, even though the chunk was mostly useless (most other data in this chunk would already be on the hash map.)
There are a few solutions to this. The first is to make a copy of the bytestring so only the required data is retained. In my experiments this wasn't helpful, but it would depend on your corpus. The second is to start with smaller chunks. Using a chunk size of 1024 worked fairly well for me. If your corpus is similar to natural language, I think it'll probably work better for you as well.
I think I solved this problem elegantly: I used Data.Text as hash map keys, instead of Data.ByteString. See the modified program below:
import qualified Data.Iteratee as I import Data.IterateeChar import Data.Iteratee.IO
import qualified Data.HashMap.Strict as T
import Data.Ord (comparing) import Data.List (sortBy) import System.Environment (getArgs)
import qualified DataByteString as S import qualified Data.Text as X import Data.Text.Encoding (decodeUtf8)
type Wordcounts = T.HashMap X.Text Int
f' :: Monad m => I.Iteratee S.ByteString m Wordcounts f' = I.joinI $ (enumLinesBS I.><> I.filter (not.S.null)) $ I.foldl' (\t s -> T.insertWith (+) (convert s) 1 t) T.empty where convert = decodeUtf8
main :: IO () main = getArgs >>= fileDriverVBuf 65536 f'.head
= mapM_ (\(w,c)-> putStrLn $ X.unpack w ++ "\t" ++ show c)sortM where sortM = sortBy (comparing snd) . T.toList
Initial benchmarks on the realistic 100MB Gutenberg corpus I downloaded with my
script yesterday report the following: htop says 120M memory residency towards
the end of the life-cycle.
<
Note that Johan's Ngram code also only keeps the minimum required data, giving it a good memory profile. I didn't notice this last night because I was testing with different data, and unfortunately the peculiar distribution of that data masked this problem.
This is kind of the big problem here: whether or not you'll see the particular behaviour I was so upset about seemed to depend on the corpus' distributional properties. In any case, I would like to thank you all for helping me understand and address the issue. I probably still have a long way to go in terms of understanding space-behaviour of Haskell programs, but at least I don't see ByteStrings as this black box of efficiency anymore, but actually understand how they're structured, and what they're good at, and what they aren't good at. (take home lesson: Data.Text is really nice. Also: if iteratee has a space leak, I probably didn't hit it, really. But: if reading byte-strings, one should mind the pointers that byte-strings actually are.) Regards, Aleks _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe