
I have *no idea* if this is an appropriate place to put this, but here goes... I have a (presumably) common problem. I have a nice Haskell program. It's compute-bounded, and I'd like to make it go faster. To that end, I've been trying to optimise it. Without going into great detail, the program basically revolves around processing a list and dumping the results into a file. Notable things about this structure: it's *big*, and I only need to *map* over it. Anyway, here's what I did... I wrote a simplified, cut down version of the program so I would have a fixed test case. I compiled it with GHC 6.6 and ran it. It takes 40 seconds to complete. Next, I read the GHC manual. Apparently there's a command-line switch "-O2" which makes GHC try harder to optimise the program. (I had stupidly assumed it always tries to make good code...) The result? The program now does the same job in 30 seconds. That's quite a big speedup just for typing 3 characters! (Gee... I wonder what it does differently? I did notice that the compilation time went from 5 seconds up to 20 seconds. And the *.exe got smaller.) Next I played with profiling. If I'm reading this correctly, the "-s" profile is telling me that my program spends 72% CPU time doing GC. o_O Er... that's really suboptimal. Also - and this is really weird - the "-p" profile tells me that 82% of the program's time is spent in the quant8 function. The source code for this function is precisely quant8 :: Double -> Word8 quant8 x = floor $ x * 0xFF Why that should take 82% CPU when the program is also doing heavy-metal complex number arithmetic and trig functions I have no idea! (Since I took those measurements, I have come up with a theory. When I ask for all these computations to be done... Haskell doesn't actually *do* them. It just stores them up until "needed". It just occurred to me that quant8 is the *last* thing I ask it to do before the code hits a strict ByteString... So is the profiler just misappropriating all the work to quant8 when it's really elsewhere?) OK, so to recap: currently my program takes 30 seconds to run. Next, I changed the 2D lists (i.e., [[x]]) into 1D lists (that is, [x]). The new runtime? 30 seconds. Right, that seems to be no-op. Next I replaced lists with immutable boxed arrays. New runtime: 33 seconds. Ooops. Now, I would have thought an array would make the GC's life *much* easier by replacing tens of thousands of cons cells with one nice big array. But, apparently, no. The profiler output says GC is almost identical. Hmm... weird! Next, I switched to IO-mutable boxed arrays. (With in-place updates.) New runtime: 35 seconds. Da hell...? The more I optimise this program, the slower it goes! :'( Finally, in a fit of desperation, I changed it to *unboxed* arrays. For some reason, you can't have an unboxed array of complex numbers. So I had to make *two* unboxed arrays of doubles, and manually split/join myself. Similar problems happened elsewhere too. Suffice it to say, my program is now vomit-inducingly ugly. It doesn't even *function* correctly any more! (There's a byte alignment issue somewhere... I can't track it down.) New runtime: 15 seconds. I say again: 15 seconds. Changing to an unboxed array has *doubled* the speed. Not only that, the GC report now shows GC as using 7% CPU time, and peak RAM usage has been reduced from 25 MB to 4 MB. The memory graph is now utterly flat, whereas even with IO-mutable arrays with in-place update, it was still fluctuating all over the place. I will mention that the ByteString module was the only thing I would find in all of Haskell for doing binary I/O. The final (fast) program uses an IOUArray Int Word8 to do the binary I/O instead. (What's the difference BTW? Is there one? Other than the name anyway.) Perhaps part of the speed was in avoiding an array type conversion? I don't know. Certainly this is the messiest part, and the part where I suspect my bug is hiding. Finally, in the fast version, quant8 uses way less CPU (more like 50%), and the complex number math uses much more. Really goes look like the profiler being counterintuitive to me... Personally, I find it quite worrying that I had to manually mutilate my program to unbox everything when the compiler is supposed to be able to do such things automatically. Does GHC "know" about the array types? Or are they just a bunch of FFI calls that some 3rd party wrote? When I talk to other people about Haskell, they all say "immutable sucks; that must be SO slow!" and I tell them "yeah, but the compiler can strip almost all of it out automatically". Well, my experience with this program suggests very differently. I do wonder if the program went faster just because unboxed arrays are strict. I couldn't think of any way to make a boxed array be strict. (I tried for ages!) Next task: Attempt to make the program use both cores on my dual-core monster machine for even more speed... A short program synopsys: module Main where init_frame :: [[Complex Double]] next_frame :: [[Complex Double]] -> [[Complex Double]] process :: [[Complex Double]] -> [[Colour]] save_file :: FilePath -> [[Colour]] -> IO () main = do work 0 init_frame work n xss = do save_file ("Frame" ++ show n) (process xss) if n < 500 then work (n+1) (next_frame xss) else return ()

On Fri, Apr 20, 2007 at 07:01:32PM +0100, Andrew Coppin wrote:
I have *no idea* if this is an appropriate place to put this, but here goes...
Sure.
Next, I read the GHC manual. Apparently there's a command-line switch "-O2" which makes GHC try harder to optimise the program. (I had stupidly assumed it always tries to make good code...) The result? The program now does the same job in 30 seconds. That's quite a big speedup just for typing 3 characters! (Gee... I wonder what it does differently? I did notice that the compilation time went from 5 seconds up to 20 seconds. And the *.exe got smaller.)
This is standard behavior for compilers.
Right, that seems to be no-op. Next I replaced lists with immutable boxed arrays. New runtime: 33 seconds.
Ooops. Now, I would have thought an array would make the GC's life *much* easier by replacing tens of thousands of cons cells with one nice big array. But, apparently, no. The profiler output says GC is almost identical. Hmm... weird!
The compiler is pretty smart about lists. As long as you're doing simple things with them (e.g. mapping), odds are pretty good there won't be any intermediate cons cells generated (this is called deforestation). In contrast, no such optimization is available (yet) for arrays. Every time you create a new array, it needs to actually be created.
Finally, in a fit of desperation, I changed it to *unboxed* arrays. For some reason, you can't have an unboxed array of complex numbers. So I had to make *two* unboxed arrays of doubles, and manually split/join myself. Similar problems happened elsewhere too. Suffice it to say, my program is now vomit-inducingly ugly. It doesn't even *function* correctly any more! (There's a byte alignment issue somewhere... I can't track it down.) New runtime: 15 seconds.
Indeed, unboxed arrays are much nicer, but the whole Array batch of the standard library is pretty useless, in my experience. You can roll your own relatively easily using ForeignPtr, similar to Data.ByteString, but that's a bit of a pain.
I will mention that the ByteString module was the only thing I would find in all of Haskell for doing binary I/O. The final (fast) program uses an IOUArray Int Word8 to do the binary I/O instead. (What's the difference BTW? Is there one? Other than the name anyway.) Perhaps part of the speed was in avoiding an array type conversion? I don't know. Certainly this is the messiest part, and the part where I suspect my bug is hiding.
Binary IO is definitely an issue with Haskell.
Finally, in the fast version, quant8 uses way less CPU (more like 50%), and the complex number math uses much more. Really goes look like the profiler being counterintuitive to me...
Profilers are always counterintuitive in any language. (Okay, not quite always, but pretty much whenever you actually need a profiler.)
A short program synopsys:
module Main where
init_frame :: [[Complex Double]]
next_frame :: [[Complex Double]] -> [[Complex Double]]
process :: [[Complex Double]] -> [[Colour]]
save_file :: FilePath -> [[Colour]] -> IO ()
main = do work 0 init_frame
work n xss = do save_file ("Frame" ++ show n) (process xss) if n < 500 then work (n+1) (next_frame xss) else return ()
If you want actual help, you'll have to supply something more than this. You could cut out the actual "work", and just post some code that does the framework. My guess (and you should have tested this) is that the runtime won't decrease much if you cut out the actual work (or replace it by something simple, like taking a exponential of everything), and the result would be that we'd be able to do more than commiserate. What happens if you cut the save_file from the process? Far better than profiling is checking by hand the actual removal of code from the program, as profiling has a tendancy to disable optimizations. You did run your timings with profiling disabled? You've got three potentially expensive functions here (process, save_file and next_frame), and it'd certainly help to have some idea which is the bottleneck. Due to lazy evaluation (less of an issue with unboxed arrays), you do need to be careful when removing bits of code, as this could cause other bits of code to not actually be run (e.g. if you never print the output, it might be that the next_frame is never evaluated, so you might need to print just the last frame, or compute the sum of each frame and print that). -- David Roundy Department of Physics Oregon State University

Hello Andrew, Friday, April 20, 2007, 10:01:32 PM, you wrote:
I have *no idea* if this is an appropriate place to put this, but here goes...
http://haskell.org/haskellwiki/Library/AltBinary http://haskell.org/haskellwiki/Modern_array_libraries last article contains information about slowness of GC with boxed arrays. also, unboxed arrays of Complex are made via ForeignArray (starting from 6.6 it has the same speed as UArray) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
participants (3)
-
Andrew Coppin
-
Bulat Ziganshin
-
David Roundy