
Auto-replying again :) Am Donnerstag 25 Februar 2010 04:26:22 schrieb Daniel Fischer:
Am Donnerstag 25 Februar 2010 03:54:17 schrieb Daniel Fischer:
Am Donnerstag 25 Februar 2010 03:04:02 schrieb Chris Pettitt:
Hello Haskell-Beginners,
I'm having a fair amount of difficulty understanding the ST monad. In my basic experimentation I'm finding that I'm getting more heap allocated with STUArray than with a regular Array, which is not what I would expect.
And ordinarily you don't. The problem is that the STUArray suffers very badly from profiling. Compiling without -prof, the STUArray code allocates about half as much as the UArray code and takes ~42% of the time. With profiling, STUArray allocates ~40% more and takes ~50% longer than UArray.
Oh, and: In the UArray code, you specified the index type as Int, in the STUArray code, you didn't specify it, so it was taken to be Integer, which slows down performance significantly, changing sumInPlace1 to
sumInPlace1 :: [Int] -> Int sumInPlace1 xs = (! 0) . runSTUArray $ do a <- newArray (0 :: Int, 0) 0 forM_ xs $ \x -> do x' <- (readArray a 0) writeArray a 0 (x' + x) return a
makes the STUArray code allocate ~6% of what the UArray code allocates and run in ~8% of the time, because now we get a pretty nice loop involving only unboxed Ints. If we then replace readArray and writeArray with unsafeRead and unsafeWrite, we see that the most time is spent on bounds checking, because now the STUArray loop becomes really compact and runs a good three times faster, so it's now over forty times faster than the UArray (for which using unsafeAt instead of (!) doesn't make a noticeable difference).
To illustrate: With sumInPlace1 :: [Int] -> Int sumInPlace1 xs = (`unsafeAt` 0) . runSTUArray $ do a <- newArray (0 :: Int, 0) 0 forM_ xs $ \x -> do x' <- (unsafeRead a 0) unsafeWrite a 0 (x' + x) return a , the non-profiling code generates a tight loop, with constant allocation (*no* allocation for the enumFromTo, everything runs in registers until the end). Beautiful and fast. The profiling code can't optimise as much, the result is that the profiling code takes (for a limit of 100 million) 85 times as long as the non-profiling code. Ouch! For the UArray code, the profiling version takes only twice as long as the non-profiling version.
Two things: - profiling interacts badly with some optimisations, so what the profiling output says may deviate significantly from what your -O2 compiled production code actually does - some datatypes suffer more from profiling than others, so what the profiling output says for different choices of datatype may deviate significantly from how your -O2 compiled production code behaves
Normally, -prof doesn't change the behaviour very much, but sometimes it does.
One additional point of confusion for me: when I run either function with +RTS -hc and use hp2ps I get an empty graph. I've seen these tools work quite well before, so I suspect I'm doing something wrong now.
Too little actual heap usage and too short running time. Change the upper limit to 10 million and you should get a graph with actual bands.
Thanks, Chris