
Am Sonntag 31 Januar 2010 15:52:41 schrieb Stephen Tetley:
Hi Daniel
Thanks - the figures are very impressive for the stream fusion library. I knew the paper, but I hadn't looked at it the implementation.
Just for the record, the loops (my variant) coded in C take 2.7 seconds (icc) resp. 3.0 seconds (gcc) with -O3. Markus' loop takes 5.32s (icc) resp. 6.04s (gcc) with -O3. Well, that's largely due to the fact that that takes two divisions per iteration (and divisions are very slow on my box). Rewriting it so that only one division per iteration is necessary: ================================================ module Main (main) where main :: IO () main = do putStrLn "EPS: " eps <- readLn :: IO Double let pi14 = pisum 1 3 False (eps*0.25) putStrLn $ "PI mit EPS "++(show eps)++" = "++ show(4*pi14) pisum s i b eps | d < eps = s | b = let h = s+d in h `seq` pisum h (i+2) False eps | otherwise = let h = s-d in h `seq` pisum h (i+2) True eps where d = 1/i ================================================ it takes 5.42 seconds (eps = 1e-8) when compiled natively, that loop in C is practically indistinguishable from my loop variant. So stream-fusion or hand-coded loops in Haskell are clearly slower than C loops, but don't stink when compiled with the native code generator. However, I get identical to gcc performance with ghc -O2 -fexcess-precision -fvia-C -optc-O3 [-optc-ffast-math doesn't make any difference then] for the hand-coded loops (all, my variant, Markus' original, Markus' rewritten - of course, each identical to the corresponding C-loop). Also for stream-fusion. Some of the list creating variants also profit from compilation via C, others not. However, none come near the NCG loop performance, let alone C loop performance. To sum up: Although GHC's native code generator does a not too bad job with arithmetic-heavy loops, gcc is better (even without optimisations). But if you compile via C, you can get identical performance - for these loops. In general, things are more complicated, often the native code generator produces better code than the via-C route. It's usually worth a test which produces the better binary, native or via-C.