
Hi Aleks,
On Wed, Jun 1, 2011 at 12:14 AM, Aleksandar Dimitrov
I implemented your method, with these minimal changes (i.e. just using a main driver in the same file.)
countUnigrams :: Handle -> IO (M.Map S.ByteString Int) countUnigrams = foldLines (\ m s -> M.insertWith (+) s 1 m) M.empty
main :: IO () main = do (f:_) <- getArgs openFile f ReadMode >>= countUnigrams >>= print . M.toList
It seems to perform about 3x worse than the iteratee method in terms of time, and worse in terms of space :-( On Brandon's War & Peace example, hGetLine uses 1.565 seconds for the small file, whereas my iteratee method uses 1.085s for the small file, and around 2 minutes for the large file.
That's curious. I chatted with Duncan Coutts today and he mentioned that hGetLine can be a bit slow as it needs to take a lock in every read and causes some copying, which could explain why it's slower than iteratee which works in blocks. However, I don't understand why it uses more memory. The ByteStrings that are returned by hGetLine should have an underlying storage of the same size as the ByteString (as reported by length). You can try to verify this by calling 'copy' on the ByteString before inserting it. It looks like hGetLine needs some love.
I also tried sprinkling strictness annotations throughout your above code, but I failed to produce good results :-(
The strictness of the code I gave should be correct. The problem should be elsewhere.
I, unfortunately, don't really have any contact to "the elders," apart from what I read on their respective blogs…
You and everyone else. :) I just spent enough time talking to people on IRC, reading good code, blogs and mailing list posts. I think Bryan described the process pretty well in his CUFP keynote: http://www.serpentine.com/blog/2009/09/23/video-of-my-cufp-keynote/ Cheers, Johan