
On Wed, Jul 06, 2011 at 07:27:10PM -0700, wren ng thornton wrote:
I definitely agree with the iteratees comment, but I'm curious about the leaks you mention. I haven't run into leakiness issues (that I'm aware of) in my use of ByteStrings for NLP.
The issue is this: strict ByteStrings retain pointers to the original chunk. The chunk is probably bigger than you'd want to keep in memory, if you, say, wanted to just keep one or two words. In my case, the chunk was some 65K (that was my Iteratee chunk size.) There's a thread about it here, where I was fairly desperate in trying to find a solution to space-behaviour I couldn't understand at all: http://bit.ly/rharIV The thread is fairly big and in the aftermath in Johan Tibbell posted two very nice posts about memory consumption of his unordered-containers (which I found invaluable) and common data types: blog: http://blog.johantibell.com/ But I think, with today's RAM, this only shows if you try to train models on huge corpora, like Baroni et al.'s *WAC corpora (which I was using.) Regards, Aleks PS: Another nice thing about iteratees is that writing attoparsec parsers is often easy, bordering on trivial, and that one can transform them into iteratees. No need to write your own parsing iteratee (which can be a bit of a pain in the butt because of all the continuations and the… sometimes idiosyncratic documentation that I just couldn't wrap my head around. Might also just be me, though.)