
Hi Aleksandar,
On Tue, May 31, 2011 at 6:10 PM, Aleksandar Dimitrov
Say, we have an input file that contains a word per line. I want to find all unigrams (unique words) in that file, and associate with them the amount of times they occurred in the file. This would allow me, for example, to make a list of word frequencies in a given text.
Here's how I would do it: {-# LANGUAGE BangPatterns #-} module Ngram (countUnigrams) where import qualified Data.ByteString as S import qualified Data.HashMap.Strict as M import System.IO foldLines :: (a -> S.ByteString -> a) -> a -> Handle -> IO a foldLines f z0 h = go z0 where go !z = do eof <- hIsEOF h if eof then return z else do line <- S.hGetLine h go $ f z line {-# INLINE foldLines #-} -- Example use countUnigrams :: IO (M.HashMap S.ByteString Int) countUnigrams = foldLines (\ m s -> M.insertWith (+) s 1 m) M.empty stdin
RANT
I have tried and tried again to avoid writing programs in Haskell that would leak space like BP likes to leak oil. However, I have yet to produce a single instance of a program that would do anything at all and at the same time consume less memory than there is actual data in the input file.
It is very disconcerting to me that I seem to be unable, even after quite some practice, to identify space leaks in trivial programs like the above. I know of no good resource to educate myself in that regard. I have read the GHC manual, RWH's chapter on profiling, also "Inside T5"'s recent series on the Haskell heap, but no dice. Even if I can clearly see the exact line where at least some of the leaking happens (as I can in this case,) it seems impossible for me to prevent it.
*thank you very much* for reading this far. This is probably a mostly useless email anyhow, I just had to get it off my chest. Maybe, just maybe, someone among you will have a crucial insight that will save Haskell for me :-) But currently, I see no justification to not start my next project in Lua, Python or Java. Sure, Haskell's code is pretty, and it's fun, but if I can't actually *run* it, why bother? (Yes, this isn't the first time I've ran into this problem …)
We definitely need more accessible material on how to reliably write fast Haskell code. There are those among us who can, but it shouldn't be necessary to learn it in the way they did (i.e. by lots of tinkering, learning from the elders, etc). I'd like to write a 60 (or so) pages tutorial on the subject, but haven't found the time. In addition to RWH, perhaps the slides from the talk on high-performance Haskell I gave could be useful: http://blog.johantibell.com/2010/09/slides-from-my-high-performance-haskell.... Cheers, Johan