
Hi List, I'm running GHC and GCC head-to-head on the task of adding a bunch of long IOUArray-Vectors really fast. My machine is a Linux-ppc PowerBook and gets a runtime for the GHC-compiled binary that's about 10x as long as for GCC. Simon M. tells me this should be much better. Here are the precise command lines: $ gcc -O3 -o b-addvec-gcc b-addvec-c.c $ ghc -O2 -fasm --make -o b-addvec-ghc b-addvec-hs.hs $ time ./b-addvec-gcc 100000 elements 1000 real 0m5.130s user 0m4.466s sys 0m0.061s $ time ./b-addvec-ghc 100000 elements 1000 real 0m49.701s user 0m43.466s sys 0m0.586s (compiling with -fvia-C -optc-O3 runs only about 1 second longer) Can somebody shed some light on this? Regards, Sven Moritz

Sven Moritz Hallberg
I'm running GHC and GCC head-to-head on the task of adding a bunch of long IOUArray-Vectors really fast. My machine is a Linux-ppc PowerBook and gets a runtime for the GHC-compiled binary that's about 10x as long as for GCC.
Is it possible that gcc is making use of the ppc AltiVec instructions, and ghc is not? Regards, Malcolm

Hello Malcolm, Wednesday, January 18, 2006, 4:22:23 PM, you wrote:
I'm running GHC and GCC head-to-head on the task of adding a bunch of long IOUArray-Vectors really fast. My machine is a Linux-ppc PowerBook and gets a runtime for the GHC-compiled binary that's about 10x as long as for GCC.
MW> Is it possible that gcc is making use of the ppc AltiVec instructions, MW> and ghc is not? :) even C version performs only 20 millions of additions in one second because this program is most limited by memory throughput - it access to 24 memory bytes per each addition. GHC just can't produce simple loops even for "imperative" code. JHC can be much better in that area, i strongly recommend Sven to try it -- Best regards, Bulat mailto:bulatz@HotPOP.com

On Wed, Jan 18, 2006 at 06:18:29PM +0300, Bulat Ziganshin wrote:
:) even C version performs only 20 millions of additions in one second because this program is most limited by memory throughput - it access to 24 memory bytes per each addition. GHC just can't produce simple loops even for "imperative" code. JHC can be much better in that area, i strongly recommend Sven to try it
Jhc doesn't have 'true' arrays yet, partially because I have not decided how points-to analysis should work for them. (I will probably just union all their points-to information since they most likely will be filled by the same routine). GHCs indirect calls are really killing its performance in tight loops. I think there is room for collaboration between the various compilers there, since we are all moving to a c-- back end (in theory) we could work on a common c-- -> C translator that searches out such uneeded indirections and zaps them before they get to gcc which doesn't handle them well at all. John -- John Meacham - ⑆repetae.net⑆john⑈

Hello John, Thursday, January 19, 2006, 4:42:47 AM, you wrote:
sorry, with the "gcc -O3 -ffast-math -fstrict-aliasing -funroll-loops" the C version is 50 times faster than best Haskell one... it's the loop from C version:
JM> I believe something similar to what I noted here is the culprit: JM> http://www.haskell.org//pipermail/glasgow-haskell-users/2005-October/009174.... i mean this when i wrote that jhc must be better. i just forget than jhc is not yet supports arrays JM> GHCs indirect calls are really killing its performance in tight loops. I JM> think there is room for collaboration between the various compilers JM> there, since we are all moving to a c-- back end (in theory) we could JM> work on a common c-- -> C translator that searches out such uneeded JM> indirections and zaps them before they get to gcc which doesn't handle JM> them well at all. using your ideas to dramatical speed up simple loops will be very good. GHC is very good in high-level optimizations, but it simply ignores the problem of compilation simple imperative-style code -- Best regards, Bulat mailto:bulatz@HotPOP.com

John Meacham wrote:
On Wed, Jan 18, 2006 at 06:18:29PM +0300, Bulat Ziganshin wrote:
:) even C version performs only 20 millions of additions in one second because this program is most limited by memory throughput - it access to 24 memory bytes per each addition. GHC just can't produce simple loops even for "imperative" code. JHC can be much better in that area, i strongly recommend Sven to try it
Jhc doesn't have 'true' arrays yet, partially because I have not decided how points-to analysis should work for them. (I will probably just union all their points-to information since they most likely will be filled by the same routine).
GHCs indirect calls are really killing its performance in tight loops. I think there is room for collaboration between the various compilers there, since we are all moving to a c-- back end (in theory) we could work on a common c-- -> C translator that searches out such uneeded indirections and zaps them before they get to gcc which doesn't handle them well at all.
A simple tail-recursive loop shouldn't contain any indirect jumps in GHC, we are careful to compile jumps to known locations into absolute jumps (though, of course we don't do points-to). My impression is that it is the lack of real low-level loop optimisation in GHC that is really hurting with these examples, and we don't get to take advantage of GCC's loop optimiser because the C code we generate doesn't look enough like a loop - that's something we can improve on in some cases, perhaps, but when I looked at it I didn't find any quick hacks to improve things. Cheers, Simon

Hello Sven, Wednesday, January 18, 2006, 3:33:40 PM, you wrote: SMH> and gets a runtime for the GHC-compiled binary that's about 10x as long SMH> as for GCC. Simon M. tells me this should be much better. Here are the attached version is only 5 times slower :) please note that 1) unsafeRead/Write indexes from 0 and don't checks bounds (just C-like :) 2) generating random values takes about 1.5-2 seconds by itself. Haskell's RNG is very different from C's one -- Best regards, Bulat mailto:bulatz@HotPOP.com

Bulat Ziganshin wrote:
Wednesday, January 18, 2006, 3:33:40 PM, you wrote:
SMH> and gets a runtime for the GHC-compiled binary that's about 10x as long SMH> as for GCC. Simon M. tells me this should be much better. Here are the
attached version is only 5 times slower :) please note that
1) unsafeRead/Write indexes from 0 and don't checks bounds (just C-like :) 2) generating random values takes about 1.5-2 seconds by itself. Haskell's RNG is very different from C's one
I squeezed a bit more out (see attached). I think the main bottleneck is now the random number generator, in particular it is supplying boxed Doubles which have to be unboxed again before storing in the array. Cheers, Simon import Data.Array.IO import Data.Array.Base import System.Environment (getArgs) import System.Random type Vector = IOUArray Int Double main = do (n:f:m:_) <- getArgs let (nelems,niterations) = case f of "elements" -> (read n, read m) "iterations" -> (read m, read n) x <- newArray (0,nelems-1) 0 :: IO Vector v <- newArray_ (0,nelems-1) :: IO Vector x `seq` v `seq` return () for 0 nelems $ \i -> do r <- randomRIO (-1,1) unsafeWrite v i r for 0 niterations $ \_ -> for 0 nelems $ \i -> do xi <- unsafeRead x i vi <- unsafeRead v i unsafeWrite x i (xi+vi) --for 0 nelems $ \i -> -- do xi <- unsafeRead x i -- putStr (show xi) -- putChar ' ' --putChar '\n' for :: Int -> Int -> (Int -> IO a) -> IO () -- Faster equivalent of "mapM_ action [from..to-1]" for from to action | from `seq` to `seq` False = undefined for from to action = go from where go i | i>=to = return () | otherwise = do action i go $! (i+1)

Hello Simon, Wednesday, January 18, 2006, 5:31:25 PM, you wrote:
2) generating random values takes about 1.5-2 seconds by itself. Haskell's RNG is very different from C's one
SM> I squeezed a bit more out (see attached).
x `seq` v `seq` return ()
it's new trick for me :) now the difference is less than 3x btw, i also use "return $! length xs" trick to ensure that all xs elements will be evaluated
for from to action | from `seq` to `seq` False = undefined
and this changes nothing, at least with 6.4.1/mingw32 btw, using "mapM_ action [n..m]" is very common operation. can it be automatically substituted with my code by using some RULE pragmas in ghc libraries? that will automatically improve many ghc-compiled programs too, i use the following code instead of replicateM: myReplicateM n action = if (n<=5*10^4) then sequence (replicate n action) else goLarge n [] >>= return.reverse where goLarge 0 xs = return xs goLarge n xs = do x <- action (goLarge $! n-1) $! x:xs it doesn't overflow stack and works much faster for the large n. that is my testbed for this function: import Control.Monad main = do a <- replicateM 1 $ myReplicateM (1*10^6) (return 1) return $! sum (map last a) and also, how about adding to GHC strictness annotations? x <- newArray (0,nelems-1) 0 :: IO !Vector v <- newArray_ (0,nelems-1) :: IO !Vector for :: !Int -> !Int -> (!Int -> IO a) -> IO () it's SO common source of performance problems... SM> I think the main bottleneck SM> is now the random number generator, in particular it is supplying boxed SM> Doubles which have to be unboxed again before storing in the array. as i say, it uses 1.5-2 seconds, i.e. only 10% of time when you run 1000 iterations (may be you not noticed that it used only in initialization?). so, while RNG itself runs 150 times slower (!), it doesn't make so much difference when you run 1000 iterations after initial filling the array and about "using Altivec instructions". the code produced for new.hs contains only one `fadd` operation, so it is easy to find entire cycle as it is compiled by GHC. that is one: movl (%ebp), %eax cmpl 12(%esi), %eax jge L81 movl 8(%esi), %edx leal 8(%edx,%eax,8), %eax movl (%eax), %edx movl %edx, 16(%esp) movl 4(%eax), %eax movl %eax, 20(%esp) fldl 16(%esp) fstpl 24(%esp) fldl 24(%esp) fstpl 48(%esp) movl (%ebp), %eax movl 4(%esi), %edx leal 8(%edx,%eax,8), %eax movl (%eax), %edx movl %edx, 8(%esp) movl 4(%eax), %eax movl %eax, 12(%esp) fldl 8(%esp) fstpl 24(%esp) fldl 24(%esp) fstpl 40(%esp) fldl 48(%esp) faddl 40(%esp) fstpl 32(%esp) movl (%ebp), %ecx movl 8(%esi), %eax leal 8(%eax,%ecx,8), %ecx fldl 32(%esp) fstpl 24(%esp) movl 24(%esp), %eax movl 28(%esp), %edx movl %eax, (%ecx) movl %edx, 4(%ecx) incl (%ebp) movl $_s3IY_info, %eax L85: jmp *%eax L81: good work, yes? ;-) the C source is also amateur :) IF_(s3IY_entry) { W_ _c3MF; StgDouble _s3IP; StgDouble _s3IQ; StgDouble _s3IS; W_ _s3IW; FB_ _c3MF = (I_)(*Sp) >= (I_)(R1.p[3]); if (_c3MF >= 0x1U) goto _c3MI; _s3IP = PK_DBL((P_)(((R1.p[2]) + 0x8U) + ((*Sp) << 0x3U))); _s3IQ = PK_DBL((P_)(((R1.p[1]) + 0x8U) + ((*Sp) << 0x3U))); _s3IS = _s3IP + _s3IQ; ASSIGN_DBL((P_)(((R1.p[2]) + 0x8U) + ((*Sp) << 0x3U)),_s3IS); _s3IW = (*Sp) + 0x1U; *Sp = _s3IW; JMP_((W_)&s3IY_info); _c3MI: R1.p = (P_)(W_)&GHCziBase_Z0T_closure; Sp=Sp+1; JMP_(*Sp); FE_ } the only cause that this code is only 3 times slower is that C version is really limited by memory speed. when tested on 1000-element arrays, it is 20 times slower. i'm not yet tried SSE optimization for gcc ;) -- Best regards, Bulat mailto:bulatz@HotPOP.com

Hello Bulat, Wednesday, January 18, 2006, 8:34:54 PM, you wrote: BZ> the only cause that this code is only 3 times slower is that C version BZ> is really limited by memory speed. when tested on 1000-element BZ> arrays, it is 20 times slower. i'm not yet tried SSE optimization for BZ> gcc ;) sorry, with the "gcc -O3 -ffast-math -fstrict-aliasing -funroll-loops" the C version is 50 times faster than best Haskell one... it's the loop from C version: L18: fldl (%edx) faddl (%ecx) fstpl (%edx) fldl 8(%edx) faddl 8(%ecx) fstpl 8(%edx) fldl 16(%edx) faddl 16(%ecx) fstpl 16(%edx) fldl 24(%edx) faddl 24(%ecx) addl $4,%ebx addl $32,%ecx fstpl 24(%edx) addl $32,%edx cmpl -4(%ebp),%ebx jl L18 -- Best regards, Bulat mailto:bulatz@HotPOP.com

On Wed, Jan 18, 2006 at 08:54:43PM +0300, Bulat Ziganshin wrote:
sorry, with the "gcc -O3 -ffast-math -fstrict-aliasing -funroll-loops" the C version is 50 times faster than best Haskell one... it's the loop from C version:
I believe something similar to what I noted here is the culprit: http://www.haskell.org//pipermail/glasgow-haskell-users/2005-October/009174.... it is fixable, but not without modifying ghc. John -- John Meacham - ⑆repetae.net⑆john⑈

John Meacham wrote:
On Wed, Jan 18, 2006 at 08:54:43PM +0300, Bulat Ziganshin wrote:
sorry, with the "gcc -O3 -ffast-math -fstrict-aliasing -funroll-loops" the C version is 50 times faster than best Haskell one... it's the loop from C version:
I believe something similar to what I noted here is the culprit: http://www.haskell.org//pipermail/glasgow-haskell-users/2005-October/009174....
it is fixable, but not without modifying ghc.
Ah, I see what you mean by indirect jumps. Those indirect jumps go away if you compile with -optc-O2 or -fasm, they're droppings left by inadequacies in gcc's standard -O optimisation. Actually, -fasm does better by one instruction than gcc on this example: .globl Test_zdwfac_info Test_zdwfac_info: movq (%rbp),%rax cmpq $1,%rax jne .LcmO movq 8(%rbp),%r13 addq $16,%rbp jmp *(%rbp) .LcmO: leaq -1(%rax),%rcx imulq 8(%rbp),%rax movq %rax,8(%rbp) movq %rcx,(%rbp) jmp Test_zdwfac_info vs. gcc -O2: Test_zdwfac_info: .text .align 8 movq (%rbp), %rdx cmpq $1, %rdx je .L6 .L3: movq 8(%rbp), %rax imulq %rdx, %rax decq %rdx movq %rdx, (%rbp) movq %rax, 8(%rbp) jmp Test_zdwfac_info .p2align 4,,7 .L6: movq 8(%rbp), %r13 addq $16, %rbp jmp *(%rbp) We should probably reverse the sense of that branch, like gcc does. The memory accesses are still there, of course. Hopefully someday I'll get around to trying to use more registers on x86_64 again. Cheers, Simon
participants (5)
-
Bulat Ziganshin
-
John Meacham
-
Malcolm Wallace
-
Simon Marlow
-
Sven Moritz Hallberg