Why is GHC so much worse than JHC when computing the Ackermann function?

Hi all, This came up on StackOverflow [1]. When compiled with GHC (7.4.2 & 7.6.2), this simple program: main = print $ ack 4 1 where ack :: Int -> Int -> Int ack 0 n = n+1 ack m 0 = ack (m-1) 1 ack m n = ack (m-1) (ack m (n-1)) consumes all available memory on my machine and slows down to a crawl. However, when compiled with JHC it runs in constant space and is about as fast as the straightforward Ocaml version (see the SO question for benchmark numbers). I was able to fix the space leak by using CPS-conversion, but the CPS-converted version is still about 10 times slower than the naive version compiled with JHC. I looked both at the Core and Cmm, but couldn't find anything obviously wrong with the generated code - 'ack' is compiled to a simple loop of type 'Int# -> Int# -> Int#'. What's more frustrating is that running the program with +RTS -hc makes the space leak mysteriously vanish. Can someone please explain where the space leak comes from and if it's possible to further improve the runtime of this program with GHC? Apparently it's somehow connected to the stack management strategy, since running the program with a larger stack chunk size (+RTS -kc1M) makes the space leak go away. Interestingly, choosing smaller stack chunk sizes (256K, 512K) causes it to die with an OOM exception: $ time ./Test +RTS -kc256K Test: out of memory (requested 2097152 bytes) [1] http://stackoverflow.com/questions/16115815/ackermann-very-inefficient-with-... -- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments

JHC compiles to C and last time I tried this it looked very much like the
original C version:
http://www.reddit.com/r/haskell/comments/1bcru7/damn_lies_and_haskell_perfor...
This is the same thing. Compile with --tdir=/tmp/ajhc and then cat
/tmp/ajhc/main_code.c. Output should be like this:
static uint32_t A_STD
fW$__fMain_1__ack(gc_t gc,uint32_t v105553376,uint32_t v61835120)
{
if (0 == v105553376) {
return 1 + v61835120;
} else {
uint32_t v16;
uint32_t v22;
struct tup1 x2;
if (0 == v61835120) {
uint32_t v228308038 = (v105553376 - 1);
x2.t0 = v228308038;
x2.t1 = 1;
} else {
uint32_t v110947990;
uint32_t v215884490 = (v61835120 - 1);
uint32_t v62470112 = (v105553376 - 1);
v110947990 = fW$__fMain_1__ack(gc,v105553376,v215884490);
x2.t0 = v62470112;
x2.t1 = v110947990;
}
v22 = x2.t0;
v16 = x2.t1;
return fW$__fMain_1__ack(gc,v22,v16);
}
}
And it's as fast as C on my machine:
chris@midnight:~/Projects/me/ajhc$ time ./ack
65533
real 0m2.134s
user 0m2.124s
sys 0m0.000s
chris@midnight:~/Projects/me/ajhc$ gcc -O3 ack.c -o ack-c
ack.c: In function ‘main’:
ack.c:8:3: warning: incompatible implicit declaration of built-in function
‘printf’ [enabled by default]
chris@midnight:~/Projects/me/ajhc$ time ./ack-c
65533
real 0m2.255s
user 0m2.248s
sys 0m0.000s
chris@midnight:~/Projects/me/ajhc$
On 20 April 2013 10:55, Mikhail Glushenkov
Hi all,
This came up on StackOverflow [1]. When compiled with GHC (7.4.2 & 7.6.2), this simple program:
main = print $ ack 4 1 where ack :: Int -> Int -> Int ack 0 n = n+1 ack m 0 = ack (m-1) 1 ack m n = ack (m-1) (ack m (n-1))
consumes all available memory on my machine and slows down to a crawl. However, when compiled with JHC it runs in constant space and is about as fast as the straightforward Ocaml version (see the SO question for benchmark numbers).
I was able to fix the space leak by using CPS-conversion, but the CPS-converted version is still about 10 times slower than the naive version compiled with JHC.
I looked both at the Core and Cmm, but couldn't find anything obviously wrong with the generated code - 'ack' is compiled to a simple loop of type 'Int# -> Int# -> Int#'. What's more frustrating is that running the program with +RTS -hc makes the space leak mysteriously vanish.
Can someone please explain where the space leak comes from and if it's possible to further improve the runtime of this program with GHC? Apparently it's somehow connected to the stack management strategy, since running the program with a larger stack chunk size (+RTS -kc1M) makes the space leak go away. Interestingly, choosing smaller stack chunk sizes (256K, 512K) causes it to die with an OOM exception:
$ time ./Test +RTS -kc256K Test: out of memory (requested 2097152 bytes)
[1] http://stackoverflow.com/questions/16115815/ackermann-very-inefficient-with-...
-- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Hi,
On Sat, Apr 20, 2013 at 12:08 PM, Christopher Done
JHC compiles to C and last time I tried this it looked very much like the original C version: http://www.reddit.com/r/haskell/comments/1bcru7/damn_lies_and_haskell_perfor...
This is the same thing.
Yes, but with the Fibonacci example GHC wasn't consuming all available memory and running 10 times slower. Surely it must be possible to make the GHC version at least 2x as slow as Ocaml/GHC? -- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments

There's something strange going on in this example. For instance, setting (-M) heap limits as low as 40K seems to have no effect, even though the program easily uses more than 8G. Except, interrupting the program in such a case does seem to give a message about heap limits being exceeded (it won't stop on its own, though). Also, the program compiled without optimizations uses very little memory (though it's slow), which is odd, because the optimized version works exclusively on Int#, which shouldn't cause heap allocation. I filed a ticket earlier: http://hackage.haskell.org/trac/ghc/ticket/7850

Hi,
On 20 April 2013 14:20, Dan Doel
There's something strange going on in this example. For instance, setting (-M) heap limits as low as 40K seems to have no effect, even though the program easily uses more than 8G.
Apparently the only things it allocates is stack chunks. I managed to produce a version of this program that runs approximately as fast as the Ocaml one by manually unrolling the main loop [1], but it still has to be run with +RTS -kc1M to avoid the memory leak. [1] https://gist.github.com/23Skidoo/5425891 -- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments
participants (3)
-
Christopher Done
-
Dan Doel
-
Mikhail Glushenkov