Re: [Haskell-cafe] Stacking State on State.....

1 Mar 2009

      Am Sonntag, 1. März 2009 21:03 schrieb Phil:
...
Hi,
Thanks for the replies - I haven't had a chance to try out everything
suggested yet - but your explanations of transformers nailed it for me.
However, in terms of performance when stacking, I've come across something
I'm struggling to explain - I was wondering if anyone could offer up and
explanation.
I've rewritten my code twice - one with 3 stacked monads, and one with 2
stacked monads and a load of maps.  Heuristically I would have thought the
3 stacked monads would have performed as well, or even better than the 2
stacked solution, but the 2 stacked solution is MUCH faster and MUCH less
memory is used.  They are both using 90% of the same code and both chain
together the same number of computations using replicateM.
Not quite, the triple stack uses replicateM_, whihc should be a little 
cheaper.
...
From profiling
I can see that the pure function 'reflect' takes up most of the umph in
both cases - which I'd expect.  But in the triple stacked version the
garbage collector is using up >90% of the time.
I've tried using BangPatterns to reduce memory usage in the Triple Stack
version - doing this I can half the time it takes, but it is still running
at over twice the time of the two stack version.  The BangPatterns were
also put in Common Code in the reflect function - so I'd expect both
solutions to need them?
One thing that helps much is to use

import Control.Monad.State.Strict

Using the default lazy State monad, you build enormous thunks in the states, 
which harms the triple stack even more than the double stack.
With the strict State monad (and a strict left fold instead of foldr in the 
double stack), I get

./v6tripleStrict +RTS -sstderr -K16M
10.450674088955589
444,069,720 bytes allocated in the heap
234,808,472 bytes copied during GC (scavenged)
 30,504,688 bytes copied during GC (not scavenged)
 41,074,688 bytes maximum residency (9 sample(s))

        786 collections in generation 0 ( 21.03s)
          9 collections in generation 1 (  2.54s)

        106 Mb total memory in use

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    4.05s  (  4.21s elapsed)
  GC    time   23.57s  ( 24.18s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   27.62s  ( 28.40s elapsed)

  %GC time      85.3%  (85.2% elapsed)

  Alloc rate    109,646,844 bytes per MUT second

  Productivity  14.7% of total user, 14.3% of total elapsed

./v6doubleStrict +RTS -sstderr
10.450674088955592
388,795,972 bytes allocated in the heap
177,748,228 bytes copied during GC (scavenged)
 23,953,900 bytes copied during GC (not scavenged)
 44,560,384 bytes maximum residency (9 sample(s))

        710 collections in generation 0 ( 11.62s)
          9 collections in generation 1 (  3.03s)

         94 Mb total memory in use

  INIT  time    0.01s  (  0.00s elapsed)
  MUT   time   13.54s  ( 13.91s elapsed)
  GC    time   14.65s  ( 15.02s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   28.20s  ( 28.93s elapsed)

  %GC time      52.0%  (51.9% elapsed)

  Alloc rate    28,693,429 bytes per MUT second

  Productivity  48.0% of total user, 46.8% of total elapsed

So, yes, the triple stack uses more memory, but not terribly much more.
However, it spends much more time gc'ing, but as its MUT time is much less, 
the total times are not much different.

Now,if we give them enough heap space to begin with:

./v6tripleStrict +RTS -sstderr -K16M -H192M
10.450674088955589
444,077,972 bytes allocated in the heap
 95,828,976 bytes copied during GC (scavenged)
 15,441,936 bytes copied during GC (not scavenged)
 36,147,200 bytes maximum residency (2 sample(s))

          5 collections in generation 0 (  2.16s)
          2 collections in generation 1 (  0.43s)

        185 Mb total memory in use

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    4.20s  (  4.55s elapsed)
  GC    time    2.59s  (  2.74s elapsed)
  EXIT  time    0.00s  (  0.95s elapsed)
  Total time    6.79s  (  7.29s elapsed)

  %GC time      38.1%  (37.6% elapsed)

  Alloc rate    105,732,850 bytes per MUT second

  Productivity  61.9% of total user, 57.6% of total elapsed

./v6doubleStrict +RTS -sstderr -K16M -H192M
10.450674088955592
388,806,408 bytes allocated in the heap
 46,446,680 bytes copied during GC (scavenged)
     77,852 bytes copied during GC (not scavenged)
    159,744 bytes maximum residency (2 sample(s))

          4 collections in generation 0 (  1.36s)
          2 collections in generation 1 (  0.03s)

        182 Mb total memory in use

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    4.53s  (  5.11s elapsed)
  GC    time    1.39s  (  1.44s elapsed)
  EXIT  time    0.00s  (  0.02s elapsed)
  Total time    5.92s  (  6.55s elapsed)

  %GC time      23.5%  (21.9% elapsed)

  Alloc rate    85,829,229 bytes per MUT second

  Productivity  76.5% of total user, 69.2% of total elapsed

MUCH better. 
I have no idea why the MUT time for the double stack decreases so much, 
though.
...
Even though both pieces of code are a bit untidy, the triple stacked monad
'feels' nicer to me - everything is encapsulated away and one evaluation in
main yields the result.  From purely a design perspective I prefer it - but
obviously not if it runs like a dog!
Any ideas why the triple stack runs so slow?
It suffers horribly from laziness. One thing is the lazy State monad, another 
is the implementation of mc.
...
Thanks again!
Phil
***************** Triple Stack Specific Impl:
type MonteCarloStateT = StateT Double
mc :: MonteCarloStateT BoxMullerQuasiState ()
mc = StateT $ \s -> do nextNormal <- generateNormal
                       let stochastic = 0.2*1*nextNormal
                       let drift = 0.05 - (0.5*(0.2*0.2))*1
                       let newStockSum = payOff 100 ( 100 * exp ( drift +
stochastic ) ) + s
                       return ((),newStockSum)
Don't use a new let on each line, have it all in one let-block.
And, please, force the evaluation of newStockSum:

mc :: MonteCarloStateT BoxMullerQuasiState ()
mc = StateT $ \s -> do
    nextNormal <- generateNormal
    let stochastic = 0.2*1*nextNormal
        drift = 0.05 - (0.5*(0.2*0.2))*1
        !newStockSum = payOff 100 ( 100 * exp ( drift + stochastic ) ) + s
    return ((),newStockSum)

Now:

./v8tripleStrict +RTS -sstderr
10.450674088955589
396,391,172 bytes allocated in the heap
     65,252 bytes copied during GC (scavenged)
     22,272 bytes copied during GC (not scavenged)
     40,960 bytes maximum residency (1 sample(s))

        757 collections in generation 0 (  0.02s)
          1 collections in generation 1 (  0.00s)

          1 Mb total memory in use

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    3.36s  (  3.49s elapsed)
  GC    time    0.02s  (  0.05s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    3.38s  (  3.54s elapsed)

  %GC time       0.6%  (1.3% elapsed)

  Alloc rate    117,973,563 bytes per MUT second

  Productivity  99.4% of total user, 94.9% of total elapsed

w00t!
...
iterations = 1000000
main :: IO()
main = do let sumOfPayOffs = evalState ( evalStateT ( execStateT (do
replicateM_ iterations mc) $ 0 ) $ (Nothing,nextHalton) ) $ (1,[3,5])
          let averagePO = sumOfPayOffs / fromIntegral iterations
          let discountPO = averagePO * exp (-0.05)
          print discountPO
Again, don't needlessly multiply the lets.
...
***************** Double Stack and Map Specific Impl:
iterations = 1000000
main :: IO()
main = do let normals = evalState ( evalStateT (do replicateM iterations
generateNormal) $ (Nothing,nextHalton) ) $ (1,[3,5])
          let stochastic = map (0.2*1*) normals
          let sde = map ((( 0.05 - (0.5*(0.2*0.2)) )*1)+) stochastic
          let expiryMult = map exp sde
          let expiry = map (100*) expiryMult
          let payoff = map (payOff 100) expiry
          let averagePO = (foldr (+) 0 payoff) / fromIntegral iterations
          let discountPO = averagePO * exp (-0.05)
          print discountPO
Same here, but important for performance is to replace the foldr with foldl'.
...
***************** Common Code Used By Both Methods:
import Control.Monad.State
import Debug.Trace
-- State Monad for QRNGs - stores current iteration and list of
-- bases to compute
type QuasiRandomState = State (Int,[Int])
nextHalton :: QuasiRandomState [Double]
nextHalton = do (n,bases) <- get
                let !nextN = n+1
            put (nextN,bases)
            return $ map (reflect (n,1,0)) bases
type ReflectionThreadState = (Int,Double,Double)
reflect :: ReflectionThreadState -> Int -> Double
reflect (k,f,h) base
| k <= 0 = h
  | otherwise = reflect (newK,newF,newH) base
where
        newK = k `div` base
        newF = f / fromIntegral base
        newH = h + fromIntegral(k `mod` base) * newF
-- So we are defining a state transform which has state of 'maybe double'
and an
-- operating function for the inner monad of type QuasiRandomMonad
returning a [Double]
-- We then say that it wraps an QuasiRandomMonad (State Monad) - it must of
course
-- if we pass it a function that operates on these Monads we must wrap the
same
-- type of Monad.  And finally it returns a double
type BoxMullerStateT = StateT (Maybe Double, QuasiRandomState [Double])
type BoxMullerQuasiState = BoxMullerStateT QuasiRandomState
generateNormal :: BoxMullerQuasiState Double
generateNormal = StateT $ \s -> case s of
                (Just d,qrnFunc) -> return (d,(Nothing,qrnFunc))
                (Nothing,qrnFunc) -> do qrnBaseList <- qrnFunc
                                    let (norm1,norm2) = boxMuller (head
qrnBaseList) (head $ tail qrnBaseList)
                                    return (norm1,(Just norm2,qrnFunc))
boxMuller :: Double -> Double -> (Double,Double)
-- boxMuller rn1 rn2 | trace ( "rn1 " ++ show rn1 ++ " rn2 " ++ show rn2 )
False=undefined
boxMuller rn1 rn2 = (normal1,normal2)
  where
    r        = sqrt ( (-2)*log rn1)
    twoPiRn2 = 2 * pi * rn2
    normal1  = r * cos ( twoPiRn2 )
    normal2  = r * sin ( twoPiRn2 )
payOff :: Double -> Double -> Double
payOff strike stock | (stock - strike) > 0 = stock - strike
| otherwise = 0
Cheers,
Daniel

Re: [Haskell-cafe] Stacking State on State.....

Daniel Fischer