
On Wednesday 01 June 2011 12:13:54, John Lato wrote:
From: Brandon Moore
I was worried data sharing might mean your keys retain entire 64K chunks of the input. However, it seems enumLines depends on the StringLike ByteString instance, which just converts to and from String. That can't be efficient, but I suppose it avoids excessive sharing.
That's true for 'enumLines', however the OP is using 'enumLinesBS', which operates on bytestrings directly.
Data sharing certainly could be an issue here. I tried performing Data.ByteString.copy before inserting the key into the map, but that used more memory. I don't have an explanation for this; it's not what I would expect.
If you don't copy, the small ByteStrings (the words) point into the large chunk, which peacefully rests in memory. A few ripples come from the chunk boundaries, where more often than not a word will have parts in both chunks, a new ByteString is allocated for those. A bit of space is wasted for the Constructor of the large chunk and for the word-separators ('\n' here), and at the boundaries. Not a big deal, the newlines cost one byte per text-word, but the constructors cost five or so machine words per text-word. If you copy each small ByteString from the chunk, a) the chunk stays in memory until it's completely copied [doesn't matter much with chunk sizes of a few k] and b) you get memory fragmentation because of alignment requirements.
The other parameter which affects sharing is the chunk size. I got a much better memory profile when using a chunksize of 1024 instead of 65536.
Oddly enough, when using the large chunksize I saw lower memory usage from Data.Map, but with the small chunksize Data.HashMap has a significant advantage.
John Lato