
Matthew Bromberg wrote:
I used what I thought, initially was an elegant contruction technique in Haskell. Something like this do ... sequence $ [ reffill b s | s <- [0..(fi temits)-1], b <- [0..(fi nc)-1]] ...(push list on to matrix stack)
Try the sequence_ (note the underscore) function, it should be a big win here. Cheers, Spencer Janssen
Now thats interesting. I can see that this function is more appropriate since I do not need to retrieve data from the IO monad, but what I don't understand is why it's actually faster. I will give it a try and test it on a large set to see if things change. Thanks for the tip.
The best way I have to explain is to pedantically go through how I would try to understand why it is faster. I hope this is something useful in this message, and nothing that is taken as condescending. Thinking about memory usage and garbage collection in strict language like Java is tricky, and thinking about them in non-strict Haskell is another layer of consideration. ( But in this case it will be quite easy to understand from the code.) I will look at the code: 1. See that sequence and sequence_ are exposed by Prelude 2. Since that is part of the Haskell 98 definition, google a copy of the "Haskell 98 report" at http://www.haskell.org/onlinereport/ 3. Look at "8. Standard Prelude" at http://www.haskell.org/onlinereport/standard-prelude.html 4. Scroll down to sequence and sequence_ to see:
sequence :: Monad m => [m a] -> m [a] sequence = foldr mcons (return []) where mcons p q = p >>= \x -> q >>= \y -> return (x:y)
sequence_ :: Monad m => [m a] -> m () sequence_ = foldr (>>) (return ())
A more generally useful way to look up the actual ghc source code is: 1. In ghci do ":i sequence" to see it comes from Control.Monad 2. look at http://www.haskell.org/ghc/docs/latest/html/libraries/index.html 3. See "Control.Monad" is in the "base" packagage 4. Browse through http://haskell.org/ghc/ to "Developers(Wiki)" and "Getting the Sources" to http://hackage.haskell.org/trac/ghc/wiki/GhcDarcs 5. Follow the link to package "base" via http://darcs.haskell.org/packages/base/ 6. Browse to "Control" and "Monad.hs" to http://darcs.haskell.org/packages/base/Control/Monad.hs 7. Scroll down to sequence and sequence_ to see:
-- | Evaluate each action in the sequence from left to right, -- and collect the results. sequence :: Monad m => [m a] -> m [a] {-# INLINE sequence #-} sequence ms = foldr k (return []) ms where k m m' = do { x <- m; xs <- m'; return (x:xs) }
-- | Evaluate each action in the sequence from left to right, -- and ignore the results. sequence_ :: Monad m => [m a] -> m () {-# INLINE sequence_ #-} sequence_ ms = foldr (>>) (return ()) ms
The implementation code is only 1 or 2 lines, and reveals it is just really useful shorthand for a right fold. Right folds are notorious for being bad memory consumers when they are strict in the second argument of their accumulation function. And indeed this is the problem in this case. Looking at sequence_ first, since it is simpler: It essentially says to put (>>) between all the elements of the list of IO actions which is equivalent to putting the actions one after another in "do" notation. It never needs to remember the result of any of the actions, so the garbage collector will occasionally run and destroy the intermediate results. Those intermediate results may pile up in memory as dead references, so the gc might clean them only after they (or something else) cause memory pressure. Now look at sequence, and remember that the Monad m here is Monad IO. The IO monad runs in a strict manner. Consider " do { x <- m; xs <- m'; return (x:xs) }" which could have been written sequence ms = foldr k (return []) ms where k m m' = do x <- m xs <- m' return (x:xs) sequence [a,b,c] = foldr k (return []) [a,b,c] can be expanded via the foldr definition and some syntactic sugar as a `k` (b `k` (c `k` return [])) can be expanded via the `k` definition and some syntactic sugar as do x <- a xs <- (b `k` (c `k` return [])) return (x:xs) So you can see the return value of a is x. Then it goes and computes the rest of the sequence for b and c while holding onto the reference for x. The "return (x:xs)" line is later and also refers to x, which means x is live instead of dead, so the garbage collector will not remove it. The same analysis applies to the values returned by b and c. All the intermediate values are live until the first `k` executes the "return" statement with all the values. This is why the memory usage is maximal. The problem was created by the IO Monad evaluating the b and c strictly once "xs <- (b `k` (c `k` return []))" was encountered. The use of sequence with a different Monad which was lazy instead would have different evaluation order and memory usage. In particular "Control.Monad.ST.Strict" and "Control.Monad.ST.Lazy" have opposite behaviors in this regard. Other things I notice from the code: sequence and sequence_ work with any Monad, which should make one curious about what they are good shorthand for in non-deterministic monads like List. They are also just as handy in Maybe / Either / etc... The Haddock formatted comments, which are where the documentation comes from: http://www.haskell.org/ghc/docs/latest/html/libraries/base/Control-Monad.htm... The GHC specific comment "INLINE" is being used. This reliably turns sequence and sequence_ into shorthand for their foldr definitions during compilation, allowing further improvements such as deforestation when possible. -- Chris