
Don Stewart wrote:
Here's an example that doesn't use floating point:
import Data.Array.Vector import Data.Bits
main = print . sumU $ zipWith3U (\x y z -> x * y * z) (enumFromToU 1 (100000000 :: Int)) (enumFromToU 2 (100000001 :: Int)) (enumFromToU 7 (100000008 :: Int))
In core:
main_$s$wfold :: Int# -> Int# -> Int# -> Int# -> Int# main_$s$wfold = \ (sc_s1l1 :: Int#) (sc1_s1l2 :: Int#) (sc2_s1l3 :: Int#) (sc3_s1l4 :: Int#) -> case># sc2_s1l3 100000000 of _ { False -> case># sc1_s1l2 100000001 of _ { False -> case># sc_s1l1 100000008 of _ { False -> main_$s$wfold (+# sc_s1l1 1) (+# sc1_s1l2 1) (+# sc2_s1l3 1) (+# sc3_s1l4 (*# (*# sc2_s1l3 sc1_s1l2) sc_s1l1)); True -> sc3_s1l4 }; True -> sc3_s1l4 }; True -> sc3_s1l4 }
Rather nice!
-fvia-C -optc-O3
Main_mainzuzdszdwfold_info: cmpq $100000000, %rdi jg .L6 cmpq $100000001, %rsi jg .L6 cmpq $100000008, %r14 jle .L10 .L6: movq %r8, %rbx movq (%rbp), %rax jmp *%rax .L10: movq %rsi, %r10 leaq 1(%rsi), %rsi imulq %rdi, %r10 leaq 1(%rdi), %rdi imulq %r14, %r10 leaq 1(%r14), %r14 leaq (%r10,%r8), %r8 jmp Main_mainzuzdszdwfold_info
Which looks ok.
$ time ./zipwith3 3541230156834269568 ./zipwith3 0.33s user 0.00s system 99% cpu 0.337 total
And -fasm we get very different code, and a bit of a slowdown:
Main_mainzuzdszdwfold_info: .Lc1mo: cmpq $100000000,%rdi jg .Lc1mq cmpq $100000001,%rsi jg .Lc1ms cmpq $100000008,%r14 jg .Lc1mv
movq %rsi,%rax imulq %r14,%rax movq %rdi,%rcx imulq %rax,%rcx movq %r8,%rax addq %rcx,%rax leaq 1(%rdi),%rcx leaq 1(%rsi),%rdx incq %r14 movq %rdx,%rsi movq %rcx,%rdi movq %rax,%r8 jmp Main_mainzuzdszdwfold_info
.Lc1mq: movq %r8,%rbx jmp *(%rbp) .Lc1ms: movq %r8,%rbx jmp *(%rbp) .Lc1mv: movq %r8,%rbx jmp *(%rbp)
Slower:
$ time ./zipwith3 3541230156834269568 ./zipwith3 0.38s user 0.00s system 98% cpu 0.384 total
Now maybe we need to wait on the new backend optimizations to get there?
I've been looking at some of these cases and seeing how the LLVM back-end performs. My general impression from benchmarking the LLVM back-end in the past has been that it generally performs with similar characteristics as the C code generator (that is, where the C code generator stood out compared to the NCG, so did LLVM). (On x86-32/Mac OS X 10.5, had some issues getting x64 working at moment): ./zipWith3_viac 0.72s ./zipWith3_fasm 0.65s ./zipWith3_llvm 0.38s Code that LLVM produces: _Main_mainzuzdszdwfold_entry: ## BB#0: ## %c1qP subl $12, %esp jmp LBB2_1 .align 4, 0x90 LBB2_4: ## %n1re ## Loop Depth 1 ## Loop Header is BB2_1 ## Inner Loop movl %ecx, %esi incl %ecx imull %eax, %esi incl %eax imull %edx, %esi incl %edx addl (%ebp), %esi movl %edx, 12(%ebp) movl %ecx, 8(%ebp) movl %eax, 4(%ebp) movl %esi, (%ebp) LBB2_1: ## %tailrecurse ## Loop Depth 1 ## Loop Header ## Inner Loop movl 4(%ebp), %eax cmpl $100000000, %eax jg LBB2_5 ## BB#2: ## %n1qX ## Loop Depth 1 ## Loop Header is BB2_1 ## Inner Loop movl 8(%ebp), %ecx cmpl $100000001, %ecx jg LBB2_5 ## BB#3: ## %n1r5 ## Loop Depth 1 ## Loop Header is BB2_1 ## Inner Loop movl 12(%ebp), %edx cmpl $100000008, %edx jle LBB2_4 LBB2_5: ## %c1qW movl 16(%ebp), %eax movl (%ebp), %esi addl $16, %ebp movl (%eax), %eax addl $12, %esp jmpl *%eax # TAILCALL Which is very nice. (The comments in the code are inserted by LLVM, not me). I also ran through some of the programs outlined here: http://permalink.gmane.org/gmane.comp.lang.haskell.glasgow.user/18151 All ran with 'echo '1e-8' | ./$PRG'. Loop.hs: ======================================== {-# LANGUAGE BangPatterns #-} module Main (main) where main :: IO () main = do putStrLn "EPS: " eps <- readLn :: IO Double let !mx = (4/eps) !pi14 = pisum mx putStrLn $ "PI mit EPS "++(show eps)++" = "++ show(4*pi14) pisum :: Double -> Double pisum cut = go True 1 0 where go b n s | cut < n = if b then s+1/(2*n) else s-1/(2*n) go True n !s = go False (n+2) (s+recip n) go False n !s = go True (n+2) (s-recip n) ======================================== ./Loops_fasm 4.53s ./Loops_viac 4.22s ./Loops_llvm 2.89s Fusion.hs (uses stream-fusion package) ======================================== module Main (main) where import qualified Data.List.Stream as S main :: IO () main = do putStrLn "EPS: " eps <- readLn :: IO Double let !mx = floor (4/eps) !k = (mx+1) `quot` 2 putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k) leibniz n = (4 *) $ S.sum $ S.take n step step :: [Double] step = S.unfoldr phi (True,1) where phi (sig,d) | sig = Just (1/d, (False,d+2)) | otherwise = Just (negate (1/d), (True,d+2)) ======================================== ./Fusion_fasm 4.61s ./Fusion_viac 4.22s ./Fusion_llvm 3.62s List.hs ======================================== module Main (main) where import Data.List (unfoldr) main :: IO () main = do putStrLn "EPS: " eps <- readLn :: IO Double let mx = floor (4/eps) !k = (mx+1) `quot` 2 putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k) leibniz n = (4 *) $ sum $ take n step step :: [Double] step = unfoldr phi (True,1) where phi (sig,d) | sig = Just (1/d, (False,d+2)) | otherwise = Just (negate (1/d), (True,d+2)) ======================================== ./List_fasm 18.21s ./List_viac 16.71s ./List_llvm 16.92s So with these kinds of results (obviously I'm biased though since I wrote the llvm back-end) I think the sentiment that the -fvia-C approach should be eventually removed is the right way to go since with the LLVM back-end and the new code generator there is a promising and much more interesting future. ~ David