What I learned from my first serious attempt low-level Haskell programming

As a learning excersize, I re-wrote and re-optimized Data.Binary.Builder yesterday. 1. Intuition is NOT your friend. Most obvious pessimizations I made were actually wins, and vice versa. 2. Parameters are very expensive. Our type of functions that build (ignoring CPS for the time being) was MBA# -> Int# -> [ByteString], where the Int# is the current write pointer. Adding an extra Int# to cache the size of the array (rather than calling sMBA# each time) slowed the code down ~2x. Conversely, moving the write pointer into the byte array (storing it in bytes 0#, 1#, 2#, and 3#) sped the code by 4x. 3. MBA# is just as fast as Addr#, and garbage collected to boot. 4. You can't keep track of which version of the code is which, what is a regression, and what is an enhancement. Don't even try. Next time I try something like this I will make as much use of darcs as possible. 5. State# threads clog the optimizer quite effectively. Replacing st(n-1)# with realWorld# everywhere I could count on data dependencies to do the same job doubled performance. 6. The inliner is a bit too greedy. Removing the slow-path code from singleton doesn't help because popSingleton is only used once; but if I explicitly {-# NOINLINE popSingleton #-}, the code for singleton itself becomes much smaller, and inlinable (15% perf gain). Plus the new singleton doesn't allocate memory, so I can use even MORE realWorld#s. And probably a few more I forgot about because of #4. The code is online at http://members.cox.net/stefanor/hackedbuilder if anyone cares (but see #4). Some parting numbers: (Builder7 is my current version, Builder1 is the unmodified rossp/kolmodin builder) stefan@stefans:~/hackedbuilder$ ghc -v0 --make -O2 -fforce-recomp -DBUILDER=Builder7 Bench.hs ; time ./Bench 2 10000000 330000000 real 0m5.580s user 0m5.540s sys 0m0.032s stefan@stefans:~/hackedbuilder$ ghc -v0 --make -O2 -fforce-recomp -DBUILDER=Builder7 -DUNROLL Bench.hs ; time ./Bench 2 10000000 330000000 real 0m2.948s user 0m2.908s sys 0m0.036s stefan@stefans:~/hackedbuilder$ ghc -v0 --make -O2 -fforce-recomp -DBUILDER=Builder1 Bench.hs ; time ./Bench 2 10000000 330000000 real 0m55.708s user 0m54.695s sys 0m0.208s stefan@stefans:~/hackedbuilder$ ghc -v0 --make -O2 -fforce-recomp -DBUILDER=Builder1 -DUNROLL Bench.hs ; time ./Bench 2 10000000 330000000 real 0m25.888s user 0m25.546s sys 0m0.156s stefan@stefans:~/hackedbuilder$ gcc -O2 -march=pentium4 CBuilder.c -o CBuilder ; time ./CBuilder 10000000 real 0m0.861s user 0m0.860s sys 0m0.000s Stefam

| 5. State# threads clog the optimizer quite effectively. Replacing | st(n-1)# with realWorld# everywhere I could count on data | dependencies to do the same job doubled performance. The idea is that the optimiser should allow you to write at a high level, and do the book keeping for you. When it doesn't, I like to know, and preferably fix. If you had a moment to boil out a small, reproducible example of this kind of optimisation failure (with as few dependencies as poss), then I'll look to see if the optimiser can be cleverer. | | 6. The inliner is a bit too greedy. Removing the slow-path code from | singleton doesn't help because popSingleton is only used once; but | if I explicitly {-# NOINLINE popSingleton #-}, the code for | singleton itself becomes much smaller, and inlinable (15% perf | gain). Plus the new singleton doesn't allocate memory, so I can | use even MORE realWorld#s. That's a hard one! Inlining functions that are called just once is a huge win usually. I don't know how to spot what you did in an automated way. thanks Simon

Stefan O'Rear wrote:
2. Parameters are very expensive. Our type of functions that build (ignoring CPS for the time being) was MBA# -> Int# -> [ByteString], where the Int# is the current write pointer. Adding an extra Int# to cache the size of the array (rather than calling sMBA# each time) slowed the code down ~2x. Conversely, moving the write pointer into the byte array (storing it in bytes 0#, 1#, 2#, and 3#) sped the code by 4x.
If you were measuring on x86 then parameters are passed on the stack, which may be expensive. On x86_64 the first 3 arguments are passed in registers, which is usually a win, but if the function immediately does an eval they need to be saved on the stack anyway. Still, 4x sounds like a lot, perhaps you managed to avoid a stack check in the inner loop or something.
3. MBA# is just as fast as Addr#, and garbage collected to boot.
Not really surprising, that.
4. You can't keep track of which version of the code is which, what is a regression, and what is an enhancement. Don't even try. Next time I try something like this I will make as much use of darcs as possible.
Absolutely - if you'd used darcs, then we could peer in more detail at changes that you thought gave counter-intuitive results. Simon Peyton-Jones wrote:
| 5. State# threads clog the optimizer quite effectively. Replacing | st(n-1)# with realWorld# everywhere I could count on data | dependencies to do the same job doubled performance.
The idea is that the optimiser should allow you to write at a high level, and do the book keeping for you. When it doesn't, I like to know, and preferably fix.
If you had a moment to boil out a small, reproducible example of this kind of optimisation failure (with as few dependencies as poss), then I'll look to see if the optimiser can be cleverer.
Yes, and *please* add some of this folklore to the performance wiki at http://haskell.org/haskellwiki/Performance, if you have the time. Cheers, Simon

Stefan O'Rear wrote:
2. Parameters are very expensive. Our type of functions that build (ignoring CPS for the time being) was MBA# -> Int# -> [ByteString], where the Int# is the current write pointer. Adding an extra Int# to cache the size of the array (rather than calling sMBA# each time) slowed the code down ~2x. Conversely, moving the write pointer into the byte array (storing it in bytes 0#, 1#, 2#, and 3#) sped the code by 4x.
If you were measuring on x86 then parameters are passed on the stack, which may be expensive. On x86_64 the first 3 arguments are passed in registers, which is usually a win, but if the function immediately does an eval they need to be saved on the stack anyway. Still, 4x sounds like a lot, perhaps you managed to avoid a stack check in the inner loop or something.
just out of curiosity: what does GHC do with stack parameters in loops/tail calls? there tends to be a conflict as the old set of parameters is needed to build the new one for the recursive call/next loop iteration, while one wants to get rid of the old set before doing the call. unless one keeps the parameter frames away from the stack, or can figure out a safe order in which to overwrite the parameters in the frame, that seems to imply some copying, even though that may be cheap for few/small parameters per frame. many years ago, i saw an abstract machine and RISC processor design aiming for good fp support that used two parameter stacks instead of one for just this reason. instead of moving stack frames around on a single stack, parameters were read from one stack, and built up on the other, followed by a cheap stack switch before the call. perhaps something like this might be of use here, too? claus

It's not that hard to figure out an order to permute the arguments on the stack before a tail call that minimizes that number of moves and temporary locations. Lmlc did this 20 years ago. :) -- Lennart On Apr 5, 2007, at 19:17 , Claus Reinke wrote:
Stefan O'Rear wrote:
2. Parameters are very expensive. Our type of functions that build (ignoring CPS for the time being) was MBA# -> Int# -> [ByteString], where the Int# is the current write pointer. Adding an extra Int# to cache the size of the array (rather than calling sMBA# each time) slowed the code down ~2x. Conversely, moving the write pointer into the byte array (storing it in bytes 0#, 1#, 2#, and 3#) sped the code by 4x. If you were measuring on x86 then parameters are passed on the stack, which may be expensive. On x86_64 the first 3 arguments are passed in registers, which is usually a win, but if the function immediately does an eval they need to be saved on the stack anyway. Still, 4x sounds like a lot, perhaps you managed to avoid a stack check in the inner loop or something.
just out of curiosity: what does GHC do with stack parameters in loops/tail calls?
there tends to be a conflict as the old set of parameters is needed to build the new one for the recursive call/next loop iteration, while one wants to get rid of the old set before doing the call. unless one keeps the parameter frames away from the stack, or can figure out a safe order in which to overwrite the parameters in the frame, that seems to imply some copying, even though that may be cheap for few/small parameters per frame.
many years ago, i saw an abstract machine and RISC processor design aiming for good fp support that used two parameter stacks instead of one for just this reason.
instead of moving stack frames around on a single stack, parameters were read from one stack, and built up on the other, followed by a cheap stack switch before the call. perhaps something like this might be of use here, too?
claus
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

Lennart Augustsson wrote:
It's not that hard to figure out an order to permute the arguments on the stack before a tail call that minimizes that number of moves and temporary locations. Lmlc did this 20 years ago. :)
Right, and that's what GHC does too, with a strongly-connected-component analysis of the dependencies between assignments of the args for the tail call. Cheers, Simon

So then tail calls should be very cheap when most of the arguments don't change. On Apr 10, 2007, at 10:17 , Simon Marlow wrote:
Lennart Augustsson wrote:
It's not that hard to figure out an order to permute the arguments on the stack before a tail call that minimizes that number of moves and temporary locations. Lmlc did this 20 years ago. :)
Right, and that's what GHC does too, with a strongly-connected- component analysis of the dependencies between assignments of the args for the tail call.
Cheers, Simon

On Tue, Apr 10, 2007 at 07:59:04PM +0100, Lennart Augustsson wrote:
So then tail calls should be very cheap when most of the arguments don't change.
On Apr 10, 2007, at 10:17 , Simon Marlow wrote:
Lennart Augustsson wrote:
It's not that hard to figure out an order to permute the arguments on the stack before a tail call that minimizes that number of moves and temporary locations. Lmlc did this 20 years ago. :)
Right, and that's what GHC does too, with a strongly-connected- component analysis of the dependencies between assignments of the args for the tail call.
Cheers, Simon
The tailcall in question is NOT statically known (it is a variant of CPS), so simpleminded tail recursion optimizations will not help much. Stefan

OK, so there are few extra costs for unknown tail calls. First, there might be a pipeline stall if the jump address isn't loaded into a register far enough in advance before the jump. Second, you need to pass information about the number of arguments in the caller. And finally, you need to test the number of arguments in the callee. Under the best of circumstances the argument check costs a load immediate, a compare, and a non-taken branch. But looking at the output from ghc I doubt this is the case. -- Lennart On Apr 10, 2007, at 23:13 , Stefan O'Rear wrote:
On Tue, Apr 10, 2007 at 07:59:04PM +0100, Lennart Augustsson wrote:
So then tail calls should be very cheap when most of the arguments don't change.
On Apr 10, 2007, at 10:17 , Simon Marlow wrote:
Lennart Augustsson wrote:
It's not that hard to figure out an order to permute the arguments on the stack before a tail call that minimizes that number of moves and temporary locations. Lmlc did this 20 years ago. :)
Right, and that's what GHC does too, with a strongly-connected- component analysis of the dependencies between assignments of the args for the tail call.
Cheers, Simon
The tailcall in question is NOT statically known (it is a variant of CPS), so simpleminded tail recursion optimizations will not help much.
Stefan

Lennart Augustsson wrote:
So then tail calls should be very cheap when most of the arguments don't change.
Yes, but the problem tends to be the arguments that change, and the fact that they are passed on the stack. A C loop would keep the loop-carried variables in registers. On x86_64 you get better code becuase some of the args are passed in registers, but it's not really proper loop optimisation. We know what needs to be done, and we plan to make progress on this soon, hopefully over the summer. Cheers, Simon

On 4/5/07, Simon Peyton-Jones
| 6. The inliner is a bit too greedy. Removing the slow-path code from | singleton doesn't help because popSingleton is only used once; but | if I explicitly {-# NOINLINE popSingleton #-}, the code for | singleton itself becomes much smaller, and inlinable (15% perf | gain). Plus the new singleton doesn't allocate memory, so I can | use even MORE realWorld#s.
That's a hard one! Inlining functions that are called just once is a huge win usually. I don't know how to spot what you did in an automated way.
I believe gcc uses __builtin_expect() (http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Other-Builtins.html#index-g_t_00...) to let the programmer say that a branch is unlikely to be taken (i.e. is the slow path). If such a thing were available in GHC, Stefan wouldn't even have needed to pull the slow path code from singleton; the compiler could have done it for him. -- Namasté, Jeffrey Yasskin

Hello Stefan, Thursday, April 5, 2007, 3:11:31 AM, you wrote:
2. Parameters are very expensive.
you should look at the asm code GHC generates. afair parameters are kept in stack and copied on each call (to the same place!). such sort of things are also very dependent on backend used - was it a ASM or C one? btw, it would be great if you will share your experience on haskell wiki -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Thu, Apr 05, 2007 at 02:50:49PM +0400, Bulat Ziganshin wrote:
Hello Stefan,
Thursday, April 5, 2007, 3:11:31 AM, you wrote:
2. Parameters are very expensive.
you should look at the asm code GHC generates. afair parameters are kept in stack and copied on each call (to the same place!). such sort of things are also very dependent on backend used - was it a ASM or C one?
Yes, I will update the wiki. ghc -O2, where stefan@stefans:~$ ghc -V The Glorious Glasgow Haskell Compilation System, version 6.7.20070402 stefan@stefans:~$ uname -a Linux stefans 2.6.18-3-686 #1 SMP Sun Dec 10 19:37:06 UTC 2006 i686 GNU/Linux stefan@stefans:~$ cpuid eax in eax ebx ecx edx 00000000 00000002 756e6547 6c65746e 49656e69 00000001 00000f24 00010809 00000000 3febfbff 00000002 665b5001 00000000 00000000 007b7040 80000000 80000004 00000000 00000000 00000000 80000001 00000000 00000000 00000000 00000000 80000002 20202020 20202020 20202020 6e492020 80000003 286c6574 50202952 69746e65 52286d75 80000004 20342029 20555043 30302e32 007a4847 Vendor ID: "GenuineIntel"; CPUID level 2 Intel-specific functions: Version 00000f24: Type 0 - Original OEM Family 15 - Pentium 4 Extended family 0 Model 2 - Stepping 4 Reserved 0 Brand index: 9 [not in table] Extended brand string: " Intel(R) Pentium(R) 4 CPU 2.00GHz" CLFLUSH instruction cache line size: 8 Hyper threading siblings: 1 Feature flags 3febfbff: FPU Floating Point Unit VME Virtual 8086 Mode Enhancements DE Debugging Extensions PSE Page Size Extensions TSC Time Stamp Counter MSR Model Specific Registers PAE Physical Address Extension MCE Machine Check Exception CX8 COMPXCHG8B Instruction APIC On-chip Advanced Programmable Interrupt Controller present and enabled SEP Fast System Call MTRR Memory Type Range Registers PGE PTE Global Flag MCA Machine Check Architecture CMOV Conditional Move and Compare Instructions FGPAT Page Attribute Table PSE-36 36-bit Page Size Extension CLFSH CFLUSH instruction DS Debug store ACPI Thermal Monitor and Clock Ctrl MMX MMX instruction set FXSR Fast FP/MMX Streaming SIMD Extensions save/restore SSE Streaming SIMD Extensions instruction set SSE2 SSE2 extensions SS Self Snoop HT Hyper Threading TM Thermal monitor TLB and cache info: 50: Instruction TLB: 4KB and 2MB or 4MB pages, 64 entries 5b: Data TLB: 4KB and 4MB pages, 64 entries 66: 1st-level data cache: 8KB, 4-way set assoc, 64 byte line size 40: No 2nd-level cache, or if 2nd-level cache exists, no 3rd-level cache 70: Trace cache: 12K-micro-op, 4-way set assoc 7b: 2nd-level cache: 512KB, 8-way set assoc, sectored, 64 byte line size Stefan
participants (7)
-
Bulat Ziganshin
-
Claus Reinke
-
Jeffrey Yasskin
-
Lennart Augustsson
-
Simon Marlow
-
Simon Peyton-Jones
-
Stefan O'Rear