Re: [Haskell-cafe] NLP libraries and tools?

On 7/6/11 6:45 PM, Aleksandar Dimitrov wrote:
One hint, if you ever find yourself reading in quantitative linguistic data with Haskell: forget lazy IO. Forget strict IO, except your documents aren't ever bigger than a few hundred megs. In case you're not keeping the whole document in memory, but you're keeping some stuff in memory, never keep it around in ByteStrings, but use Text or SmallString (ByteStrings will invariably leak space in this scenario.) Learn how to use Iteratees and use them judiciously.
I definitely agree with the iteratees comment, but I'm curious about the leaks you mention. I haven't run into leakiness issues (that I'm aware of) in my use of ByteStrings for NLP. -- Live well, ~wren

On Wed, Jul 06, 2011 at 07:27:10PM -0700, wren ng thornton wrote:
I definitely agree with the iteratees comment, but I'm curious about the leaks you mention. I haven't run into leakiness issues (that I'm aware of) in my use of ByteStrings for NLP.
The issue is this: strict ByteStrings retain pointers to the original chunk. The chunk is probably bigger than you'd want to keep in memory, if you, say, wanted to just keep one or two words. In my case, the chunk was some 65K (that was my Iteratee chunk size.) There's a thread about it here, where I was fairly desperate in trying to find a solution to space-behaviour I couldn't understand at all: http://bit.ly/rharIV The thread is fairly big and in the aftermath in Johan Tibbell posted two very nice posts about memory consumption of his unordered-containers (which I found invaluable) and common data types: blog: http://blog.johantibell.com/ But I think, with today's RAM, this only shows if you try to train models on huge corpora, like Baroni et al.'s *WAC corpora (which I was using.) Regards, Aleks PS: Another nice thing about iteratees is that writing attoparsec parsers is often easy, bordering on trivial, and that one can transform them into iteratees. No need to write your own parsing iteratee (which can be a bit of a pain in the butt because of all the continuations and the… sometimes idiosyncratic documentation that I just couldn't wrap my head around. Might also just be me, though.)
participants (2)
-
Aleksandar Dimitrov
-
wren ng thornton