Re: [Haskell-cafe] Fun with the ST monad

26 Feb 2011

      In part to help solidify my own understanding and usage, I wrote up
the following which shows a comparison of processing an input file.
Andrew Coppin originally posed the issue concerning strictness imposed
by using the ST monad for processing an input file.

This literate example shows a comparison of processing a file using:
    1. the ST monad
    2. the ST monad with Luke Palmer's suggested laziness
    3. the State monad
    4. a direct Iteratee (from John Millikin's Enumerator package)
    5. the same Iteratee in Monad form
    6. another slight variation of the Iteratee in Monad form

First, lets get the basics taken care of:
...
import System.IO
import System.Environment
import Data.Word
import Data.Bits
import qualified Data.ByteString as B
import Control.Applicative ( (<$>) )
import Control.Monad.Trans.Class (lift)
import Control.Monad.IO.Class
import Control.Monad.ST.Lazy
import Data.STRef.Lazy
import Control.Monad.Trans.State.Lazy
import qualified Data.Enumerator as E
import Data.Enumerator ( ($$) )
import qualified Data.Enumerator.Binary as EB
import qualified Data.Enumerator.List as EL
This example is intended to show the effects of lazy or strict
processing of a file, so an input file is needed.
...
inp = "input.example"
This input file can contain whatever you'd like, but for my testing I

simply created a 5MB file of zeros via:

    $ dd if=/dev/zero of=input.example count=10000
    $ ls -sh input.example
    4.9M input.example

The output file will use the following base name, with the number of
the processing mode appended.
...
oup = "output.example"
The stats output of ghc will be used to compare the different
processing modes, so only one process will be performed each time the
application is run.  The processing mode desired will be input as a
command-line parameter, defaulting to the first mode.
...
main = do tna <- getArgs
          let tn = read $ head $ tna ++ ["1"]
          case tn of
            6 -> testE 6 transform6
            5 -> testE 5 transform5
            4 -> testE 4 transform4
            3 -> testT 3 transform3
            2 -> testT 2 transform2
            _ -> testT 1 transform1
To build and run this example (assuming this literate source is saved
as fproc.lhs):

   $ ghc -o fproc --make fproc.lhs && for N in $(seq 1 5) ; do time ./fproc $N +RTS -t -RTS ; done

That's all the basic setup out of the way.

The actual processing of the file is irrelevant other than needing to
remember previous input to process the current input.  In my example
each byte is usually combined with the previous byte to determine the
output byte.  In the ST and State monad forms, the previous byte value
is stored in the state portion of the monad.

The ST form is my interpretation of Andrew's original intent.
...
transform1 xs = runST (newSTRef 0 >>= work xs)
...
where work [] _ = return []
          work (e:es) s = do n <- readSTRef s
                             writeSTRef s $ shiftR e 4
                             let r = if e < 32 then e else n+e
                             (r :) <$> work es s
To run this with standardized file processing, ByteString -> Word8
conversion, and output, main uses the testT wrapper.  Hopefully all
the pack and unpack operations are fusing and I haven't skewed the
results by introducing strictness at this level.
...
testT n t = let oun = oup ++ show n
                op = B.pack . t . B.unpack
            in print n >> op <$> B.readFile inp >>= B.writeFile oun
My output from this is:

./fproc 1 +RTS -t
1
<>

real	0m14.998s
user	0m13.650s
sys	0m1.333s

This is a processing rate of about 333KB/s, and memory consumption is
quite high, despite lazy processing.  Note that this is GHC 6.12.3, so
it doesn't have the IO performance updates present in 7.x.

Just to verify that there was laziness, I changed the imports from
Control.Monad.ST.Lazy and Data.STRef.Lazy to the .Strict versions and
got this:

./fproc 1 +RTS -t
1
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.
<>

real	0m10.351s

user	0m9.865s
sys	0m0.455s

Luke Palmer recommended some laziness techniques.  Notably I think he
added strictness to the STRef update value computation and used fmap
(x:) to yield a value prior to the recursion.  I don't know if the
latter is also achieved by Applicative's <$> that I used above, but
here is the updated version:
...
transform2 xs = runST (newSTRef 0 >>= work xs)
    where work [] _ = return []
          work (e:es) s = do n <- readSTRef s
                             writeSTRef s $! shiftR e 4
                             let r = if e < 32 then e else n+e
                             fmap (r :) $ work es s
This yields nearly identical results (actually slightly worse, but
that may be within the measuring variance):

./fproc 2 +RTS -t
2
<>

real	0m15.378s
user	0m13.985s
sys	0m1.346s

And just to verify that the performance is not unique to the ST monad,
here's the same thing with the State monad:
...
transform3 xs = evalState (work xs) 0
    where work [] = return []
          work (e:es) = do n <- get
                           put $ shiftR e 4
                           let r = if e < 32 then e else n+e
                           (r :) <$> work es
./fproc 3 +RTS -t
3

<>

real	0m15.351s
user	0m13.932s
sys	0m1.369s

Pretty much the same as the ST results.

Now to try the iteratee approach.  I've used John Millikin's
Enumerator package.  His package provides an Enumerator for ByteString
and I don't have any intermediate processing that I'd use an
Enumeratee for, so I simply write an Iteratee.  This is a little
different than the ST/State monad forms above because I move the
processing of writing the output file into the Iteratee, but this is
in line with the overall intent of maximal laziness and interleaving
the file reading and writing.
...
transform4 :: MonadIO m => Handle -> E.Iteratee B.ByteString m ()
transform4 h = E.continue $ work 0
    where work n E.EOF = E.yield () E.EOF
          work n (E.Chunks []) = E.continue $ work n
          work n (E.Chunks (e:es)) =
              let op a b = (shiftR b 4, if b < 32 then b else a+b)
                  (m, r) = B.mapAccumL op n e
              in do liftIO $ B.hPut h r
                    work m $ E.Chunks es
The state element is now simply the first argument to the recursive
inner work function.  Each chunk will be a ByteString, so I use
mapAccumL to process each ByteString as it's provided.

The testE wrapper is enumerator equivalent of the testT wrapper used
with the ST and State monads.
...
testE n t = do print n
               h <- openFile (oup ++ show n) WriteMode
               E.run_ (EB.enumFile inp $$ t h)
The results of this Iteratee approach:

./fproc 4 +RTS -t

4
<>

real	0m2.663s
user	0m2.549s
sys	0m0.078s

Very nice!  That's a throughput rate of about 1.85MB/s (almost 6x
faster) and memory consumption is about half of the ST/State monad
processes.

It's also worth noting that pretty much everything but the last 4
lines of the Iteratee version are mostly standard boilerplate for an
Iteratee.  The mapAccumL adds a little complexity, but the code
footprint is roughly the same size as the ST/State monad forms.
Iteratees do take some mental wrestling to come to terms with, but for
me it was probably on par with my initial introduction to using State
monads.

There are many writeups on Iteratees but I heavily replied on John
Millikin's description ("Understanding Iteratees" on
http://john-millikin.com/software/enumerator), Oleg's original paper
(http://okmij.org/ftp/Streams.html) and also Michael Snoyman's blog
(linked from the Millikin's page and directly beginning at
http://docs.yesodweb.com/blog/enumerators-tutorial-part-1).

For the next form, I'll follow Michael's lead and use the fact that an
Iteratee is itself a Monad instance to try an alternate form of
writing the transformation.  Most of the boilerplate I referred to
above is absorbed into the EL.head operation:
...
transform5 :: MonadIO m => Handle -> E.Iteratee B.ByteString m ()
transform5 h = work 0
    where work n = do e <- EL.head
                      case e of
                        Nothing -> return ()
                        Just e' -> next n e'
          next n e = do (m,r) <- return $ B.mapAccumL op n e
                        liftIO $ B.hPut h r
                        work m
...
op a b = (shiftR b 4, if b < 32 then b else a+b)
As would be expected, this has nearly identical performance to the
non-monadic version:

./fproc 5 +RTS -t
5
<>

real	0m2.356s
user	0m2.300s
sys	0m0.056s

And finally just for the sake of golfing, I reduced the work function in the previous example a litle bit:
...
transform6 :: MonadIO m => Handle -> E.Iteratee B.ByteString m ()
transform6 h = work 0
    where work n = EL.head >>= maybe (return ()) (next n)
          next n e = do let (m,r) = B.mapAccumL op n e
                        liftIO $ B.hPut h r
                        work m
          op a b = (shiftR b 4, if b < 32 then b else a+b)
That's not so bad!  The learning curve of Iteratees is non-trivial,
but the results are pretty readable, IMHO.

And here's verification that the output is reasonable:

$ ls -1sh *.example*
4.9M input.example
4.9M output.example1
4.9M output.example2
4.9M output.example3
4.9M output.example4
4.9M output.example5
4.9M output.example6

Hopefully this has been a useful comparison of using Iteratee
techniques in relation to more conventional monads, and the
performance results are good support of the usefulness of Iteratee's.

As always, the greatest benefit was probably for myself in actually
implementing and writing this up, but if you read through this far I
hope you found it readable and useful.

-Kevin Quick

-- 
-KQ