
Hello Haskell-Beginners, I'm having a fair amount of difficulty understanding the ST monad. In my basic experimentation I'm finding that I'm getting more heap allocated with STUArray than with a regular Array, which is not what I would expect. I've attached a sample program that attempts to sum a list of Ints into an array of length 1, using STUArray and UArray. The idea is based loosely on http://www.haskell.org/haskellwiki/Monad/ST, but using an array instead of an STRef. I compile using these arguments: ghc -fforce-recomp -O2 -prof -auto-all -caf-all --make sttest.hs When I run the ST-based function (sumInPlace1), I get the following output: ./sttest +RTS -sstderr 1784293664 240,944,212 bytes allocated in the heap 35,448 bytes copied during GC 4,052 bytes maximum residency (1 sample(s)) 12,332 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 459 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 0.30s ( 0.31s elapsed) GC time 0.00s ( 0.00s elapsed) RP time 0.00s ( 0.00s elapsed) PROF time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.30s ( 0.31s elapsed) %GC time 0.8% (1.2% elapsed) Alloc rate 804,910,126 bytes per MUT second Productivity 99.0% of total user, 96.5% of total elapsed While with the Array based function (sumInPlace2), I get the following output: ./sttest +RTS -sstderr 1784293664 173,627,572 bytes allocated in the heap 22,596 bytes copied during GC 3,920 bytes maximum residency (1 sample(s)) 12,464 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 333 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 0.14s ( 0.15s elapsed) GC time 0.00s ( 0.00s elapsed) RP time 0.00s ( 0.00s elapsed) PROF time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.15s ( 0.15s elapsed) %GC time 1.3% (1.9% elapsed) Alloc rate 1,200,569,571 bytes per MUT second Productivity 98.3% of total user, 94.7% of total elapsed One additional point of confusion for me: when I run either function with +RTS -hc and use hp2ps I get an empty graph. I've seen these tools work quite well before, so I suspect I'm doing something wrong now. Thanks, Chris

Am Donnerstag 25 Februar 2010 03:04:02 schrieb Chris Pettitt:
Hello Haskell-Beginners,
I'm having a fair amount of difficulty understanding the ST monad. In my basic experimentation I'm finding that I'm getting more heap allocated with STUArray than with a regular Array, which is not what I would expect.
And ordinarily you don't. The problem is that the STUArray suffers very badly from profiling. Compiling without -prof, the STUArray code allocates about half as much as the UArray code and takes ~42% of the time. With profiling, STUArray allocates ~40% more and takes ~50% longer than UArray. Two things: - profiling interacts badly with some optimisations, so what the profiling output says may deviate significantly from what your -O2 compiled production code actually does - some datatypes suffer more from profiling than others, so what the profiling output says for different choices of datatype may deviate significantly from how your -O2 compiled production code behaves Normally, -prof doesn't change the behaviour very much, but sometimes it does.
One additional point of confusion for me: when I run either function with +RTS -hc and use hp2ps I get an empty graph. I've seen these tools work quite well before, so I suspect I'm doing something wrong now.
Too little actual heap usage and too short running time. Change the upper limit to 10 million and you should get a graph with actual bands.
Thanks, Chris

Am Donnerstag 25 Februar 2010 03:54:17 schrieb Daniel Fischer:
Am Donnerstag 25 Februar 2010 03:04:02 schrieb Chris Pettitt:
Hello Haskell-Beginners,
I'm having a fair amount of difficulty understanding the ST monad. In my basic experimentation I'm finding that I'm getting more heap allocated with STUArray than with a regular Array, which is not what I would expect.
And ordinarily you don't. The problem is that the STUArray suffers very badly from profiling. Compiling without -prof, the STUArray code allocates about half as much as the UArray code and takes ~42% of the time. With profiling, STUArray allocates ~40% more and takes ~50% longer than UArray.
Oh, and: In the UArray code, you specified the index type as Int, in the STUArray code, you didn't specify it, so it was taken to be Integer, which slows down performance significantly, changing sumInPlace1 to sumInPlace1 :: [Int] -> Int sumInPlace1 xs = (! 0) . runSTUArray $ do a <- newArray (0 :: Int, 0) 0 forM_ xs $ \x -> do x' <- (readArray a 0) writeArray a 0 (x' + x) return a makes the STUArray code allocate ~6% of what the UArray code allocates and run in ~8% of the time, because now we get a pretty nice loop involving only unboxed Ints. If we then replace readArray and writeArray with unsafeRead and unsafeWrite, we see that the most time is spent on bounds checking, because now the STUArray loop becomes really compact and runs a good three times faster, so it's now over forty times faster than the UArray (for which using unsafeAt instead of (!) doesn't make a noticeable difference).
Two things: - profiling interacts badly with some optimisations, so what the profiling output says may deviate significantly from what your -O2 compiled production code actually does - some datatypes suffer more from profiling than others, so what the profiling output says for different choices of datatype may deviate significantly from how your -O2 compiled production code behaves
Normally, -prof doesn't change the behaviour very much, but sometimes it does.
One additional point of confusion for me: when I run either function with +RTS -hc and use hp2ps I get an empty graph. I've seen these tools work quite well before, so I suspect I'm doing something wrong now.
Too little actual heap usage and too short running time. Change the upper limit to 10 million and you should get a graph with actual bands.
Thanks, Chris
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

Hi Daniel,
Thank you for your analysis. I dropped -prof from the compile and
explicitly specified the index type as Int - both yielded huge
improvements. These are great tips for me to keep in mind as I
continue to get deeper into Haskell.
Thanks,
Chris
On Wed, Feb 24, 2010 at 7:26 PM, Daniel Fischer
Am Donnerstag 25 Februar 2010 03:54:17 schrieb Daniel Fischer:
Am Donnerstag 25 Februar 2010 03:04:02 schrieb Chris Pettitt:
Hello Haskell-Beginners,
I'm having a fair amount of difficulty understanding the ST monad. In my basic experimentation I'm finding that I'm getting more heap allocated with STUArray than with a regular Array, which is not what I would expect.
And ordinarily you don't. The problem is that the STUArray suffers very badly from profiling. Compiling without -prof, the STUArray code allocates about half as much as the UArray code and takes ~42% of the time. With profiling, STUArray allocates ~40% more and takes ~50% longer than UArray.
Oh, and: In the UArray code, you specified the index type as Int, in the STUArray code, you didn't specify it, so it was taken to be Integer, which slows down performance significantly, changing sumInPlace1 to
sumInPlace1 :: [Int] -> Int sumInPlace1 xs = (! 0) . runSTUArray $ do a <- newArray (0 :: Int, 0) 0 forM_ xs $ \x -> do x' <- (readArray a 0) writeArray a 0 (x' + x) return a
makes the STUArray code allocate ~6% of what the UArray code allocates and run in ~8% of the time, because now we get a pretty nice loop involving only unboxed Ints. If we then replace readArray and writeArray with unsafeRead and unsafeWrite, we see that the most time is spent on bounds checking, because now the STUArray loop becomes really compact and runs a good three times faster, so it's now over forty times faster than the UArray (for which using unsafeAt instead of (!) doesn't make a noticeable difference).
Two things: - profiling interacts badly with some optimisations, so what the profiling output says may deviate significantly from what your -O2 compiled production code actually does - some datatypes suffer more from profiling than others, so what the profiling output says for different choices of datatype may deviate significantly from how your -O2 compiled production code behaves
Normally, -prof doesn't change the behaviour very much, but sometimes it does.
One additional point of confusion for me: when I run either function with +RTS -hc and use hp2ps I get an empty graph. I've seen these tools work quite well before, so I suspect I'm doing something wrong now.
Too little actual heap usage and too short running time. Change the upper limit to 10 million and you should get a graph with actual bands.
Thanks, Chris
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

Auto-replying again :) Am Donnerstag 25 Februar 2010 04:26:22 schrieb Daniel Fischer:
Am Donnerstag 25 Februar 2010 03:54:17 schrieb Daniel Fischer:
Am Donnerstag 25 Februar 2010 03:04:02 schrieb Chris Pettitt:
Hello Haskell-Beginners,
I'm having a fair amount of difficulty understanding the ST monad. In my basic experimentation I'm finding that I'm getting more heap allocated with STUArray than with a regular Array, which is not what I would expect.
And ordinarily you don't. The problem is that the STUArray suffers very badly from profiling. Compiling without -prof, the STUArray code allocates about half as much as the UArray code and takes ~42% of the time. With profiling, STUArray allocates ~40% more and takes ~50% longer than UArray.
Oh, and: In the UArray code, you specified the index type as Int, in the STUArray code, you didn't specify it, so it was taken to be Integer, which slows down performance significantly, changing sumInPlace1 to
sumInPlace1 :: [Int] -> Int sumInPlace1 xs = (! 0) . runSTUArray $ do a <- newArray (0 :: Int, 0) 0 forM_ xs $ \x -> do x' <- (readArray a 0) writeArray a 0 (x' + x) return a
makes the STUArray code allocate ~6% of what the UArray code allocates and run in ~8% of the time, because now we get a pretty nice loop involving only unboxed Ints. If we then replace readArray and writeArray with unsafeRead and unsafeWrite, we see that the most time is spent on bounds checking, because now the STUArray loop becomes really compact and runs a good three times faster, so it's now over forty times faster than the UArray (for which using unsafeAt instead of (!) doesn't make a noticeable difference).
To illustrate: With sumInPlace1 :: [Int] -> Int sumInPlace1 xs = (`unsafeAt` 0) . runSTUArray $ do a <- newArray (0 :: Int, 0) 0 forM_ xs $ \x -> do x' <- (unsafeRead a 0) unsafeWrite a 0 (x' + x) return a , the non-profiling code generates a tight loop, with constant allocation (*no* allocation for the enumFromTo, everything runs in registers until the end). Beautiful and fast. The profiling code can't optimise as much, the result is that the profiling code takes (for a limit of 100 million) 85 times as long as the non-profiling code. Ouch! For the UArray code, the profiling version takes only twice as long as the non-profiling version.
Two things: - profiling interacts badly with some optimisations, so what the profiling output says may deviate significantly from what your -O2 compiled production code actually does - some datatypes suffer more from profiling than others, so what the profiling output says for different choices of datatype may deviate significantly from how your -O2 compiled production code behaves
Normally, -prof doesn't change the behaviour very much, but sometimes it does.
One additional point of confusion for me: when I run either function with +RTS -hc and use hp2ps I get an empty graph. I've seen these tools work quite well before, so I suspect I'm doing something wrong now.
Too little actual heap usage and too short running time. Change the upper limit to 10 million and you should get a graph with actual bands.
Thanks, Chris
participants (2)
-
Chris Pettitt
-
Daniel Fischer