Re: [Haskell-beginners] State Transformer Confusion

25 Feb 2010


      Auto-replying again :)

Am Donnerstag 25 Februar 2010 04:26:22 schrieb Daniel Fischer:
...
Am Donnerstag 25 Februar 2010 03:54:17 schrieb Daniel Fischer:
...
Am Donnerstag 25 Februar 2010 03:04:02 schrieb Chris Pettitt:
...
Hello Haskell-Beginners,
I'm having a fair amount of difficulty understanding the ST monad.
In my basic experimentation I'm finding that I'm getting more heap
allocated with STUArray than with a regular Array, which is not what
I would expect.
And ordinarily you don't. The problem is that the STUArray suffers
very badly from profiling. Compiling without -prof, the STUArray code
allocates about half as much as the UArray code and takes ~42% of the
time. With profiling, STUArray allocates ~40% more and takes ~50%
longer than UArray.
Oh, and: In the UArray code, you specified the index type as Int, in the
STUArray code, you didn't specify it, so it was taken to be Integer,
which slows down performance significantly, changing sumInPlace1 to
sumInPlace1 :: [Int] -> Int
sumInPlace1 xs = (! 0) . runSTUArray $ do
        a <- newArray (0 :: Int, 0) 0
        forM_ xs $ \x -> do
            x' <- (readArray a 0)
            writeArray a 0 (x' + x)
        return a
makes the STUArray code allocate ~6% of what the UArray code allocates
and run in ~8% of the time, because now we get a pretty nice loop
involving only unboxed Ints. If we then replace readArray and writeArray
with unsafeRead and unsafeWrite, we see that the most time is spent on
bounds checking, because now the STUArray loop becomes really compact
and runs a good three times faster, so it's now over forty times faster
than the UArray (for which using unsafeAt instead of (!) doesn't make a
noticeable difference).
To illustrate:

With

sumInPlace1 :: [Int] -> Int
sumInPlace1 xs = (`unsafeAt` 0) . runSTUArray $ do
        a <- newArray (0 :: Int, 0) 0
        forM_ xs $ \x -> do
            x' <- (unsafeRead a 0)
            unsafeWrite a 0 (x' + x)
        return a

, the non-profiling code generates a tight loop, with constant allocation 
(*no* allocation for the enumFromTo, everything runs in registers until the 
end). Beautiful and fast. The profiling code can't optimise as much, the 
result is that the profiling code takes (for a limit of 100 million) 85 
times as long as the non-profiling code. Ouch!
For the UArray code, the profiling version takes only twice as long as the 
non-profiling version.
...
...
Two things:
- profiling interacts badly with some optimisations, so what the
profiling output says may deviate significantly from what your -O2
compiled production code actually does
- some datatypes suffer more from profiling than others, so what the
profiling output says for different choices of datatype may deviate
significantly from how your -O2 compiled production code behaves
Normally, -prof doesn't change the behaviour very much, but sometimes
it does.
...
One additional point of confusion for me: when I run either function
with +RTS -hc and use hp2ps I get an empty graph. I've seen these
tools work quite well before, so I suspect I'm doing something wrong
now.
Too little actual heap usage and too short running time. Change the
upper limit to 10 million and you should get a graph with actual
bands.
...
Thanks,
Chris