
Hi Johan,
Here's how I would do it:
I implemented your method, with these minimal changes (i.e. just using a main driver in the same file.)
countUnigrams :: Handle -> IO (M.Map S.ByteString Int) countUnigrams = foldLines (\ m s -> M.insertWith (+) s 1 m) M.empty
main :: IO () main = do (f:_) <- getArgs openFile f ReadMode >>= countUnigrams >>= print . M.toList
It seems to perform about 3x worse than the iteratee method in terms of time, and worse in terms of space :-( On Brandon's War & Peace example, hGetLine uses 1.565 seconds for the small file, whereas my iteratee method uses 1.085s for the small file, and around 2 minutes for the large file. For the large file, the code above starts consuming around 2.5GB of RAM, so it clearly has a space leak somewhere. Where, I don't know. If you want to try it out, here's a short command line to make a test corpus the way Brandon made one: +++ wget 'http://www.gutenberg.org/files/2600/2600.zip'; unzip 2600.zip; touch wnp100.txt; for i in {1..100}; do echo -n "$i "; cat 2600.txt >> wnp100.txt; done; echo "Done. +++ Note, that, as I detailed in my prior email to Brandon, even if you do end up with a (supposedly) non-leaking program for this example corpus, that doesn't mean it'll scale well to real world data. I also tried sprinkling strictness annotations throughout your above code, but I failed to produce good results :-(
We definitely need more accessible material on how to reliably write fast Haskell code. There are those among us who can, but it shouldn't be necessary to learn it in the way they did (i.e. by lots of tinkering, learning from the elders, etc). I'd like to write a 60 (or so) pages tutorial on the subject, but haven't found the time.
I'd be an eager reader :-) Please do announce it on -cafe or the "usual places" should you ever come around to writing it! I, unfortunately, don't really have any contact to "the elders," apart from what I read on their respective blogs…
In addition to RWH, perhaps the slides from the talk on high-performance Haskell I gave could be useful:
http://blog.johantibell.com/2010/09/slides-from-my-high-performance-haskell....
Thanks, I'll give it a look later tomorrow! Regards, Aleks PS: Sorry I didn't answer you in #haskell, I ended up having to go afk for a short while. Thanks for all your help!