
Am Montag 15 Februar 2010 17:37:55 schrieb Simon Marlow:
On 14/02/2010 17:58, Don Stewart wrote:
igloo:
Hi all,
We are planning to remove the -fvia-c way of compiling code (unregisterised compilers will continue to compile via C only, but registerised compilers will only use the native code generator). We'll probably deprecate -fvia-c in the 6.14 branch, and remove it in 6.16.
Simon Marlow has recently fixed FP performance for modern x86 chips in the native code generator in the HEAD. That was the last reason we know of to prefer via-C to the native code generators. But before we start the removal process, does anyone know of any other problems with the native code generators that need to be fixed first?
Do we have the blessing of the DPH team, wrt. tight, numeric inner loops?
As recently as last year -fvia-C -optc-O3 was still useful for some microbenchmarks -- what's changed in that time, or is expected to change?
If you have benchmarks that show a significant difference, I'd be interested to see them.
I have a benchmark (or a couple) from the Beginners mailing list two weeks ago (thread starting in January at http://www.haskell.org/pipermail/beginners/2010-January/003356.html and continued in February at http://www.haskell.org/pipermail/beginners/2010-February/003373.html ff) which show a significant difference. Loop.hs: ======================================== {-# LANGUAGE BangPatterns #-} module Main (main) where main :: IO () main = do putStrLn "EPS: " eps <- readLn :: IO Double let !mx = (4/eps) !pi14 = pisum mx putStrLn $ "PI mit EPS "++(show eps)++" = "++ show(4*pi14) pisum :: Double -> Double pisum cut = go True 1 0 where go b n s | cut < n = if b then s+1/(2*n) else s-1/(2*n) go True n !s = go False (n+2) (s+recip n) go False n !s = go True (n+2) (s-recip n) ======================================== $ echo '1e-8' | time ./Loop ghc -O2 --make: 4.53s ghc -O2 -fexcess-precision --make: 4.54s ghc -O2 -fvia-C -optc-O3 --make: 7.52s ghc -O2 -fvia-C -optc-O3 -optc-ffast-math --make: 7.53s ghc -O2 -fvia-C -optc-O3 -optc-ffast-math -optc-fno-float-store --make: 3.02s ghc -O2 -fvia-C -optc-O3 -optc-fno-float-store --make: 3.02s ghc -O2 -fexcess-precision -fvia-C -optc-O3 --make: 3.02s The loop coded in C and compiled with gcc -O3 [-ffast-math, -fno-float- store, -msse2 make no difference there] also takes 3.02s (gcc-4.3.2), 2.70s with icc -O3 (icc 11.0). It is probably worth pointing out, however, that on Markus Böhm's box running Windows XP, the native code generator produced better code than the via-C route (NCG code was faster there than on my box [openSUSE 11.1], while -O2 -fexcess-precision -fvia-C -optc-O3 on his box was slower than NCG on mine). Similar results for Fusion.hs (uses stream-fusion package) ======================================== module Main (main) where import qualified Data.List.Stream as S main :: IO () main = do putStrLn "EPS: " eps <- readLn :: IO Double let !mx = floor (4/eps) !k = (mx+1) `quot` 2 putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k) leibniz n = (4 *) $ S.sum $ S.take n step step :: [Double] step = S.unfoldr phi (True,1) where phi (sig,d) | sig = Just (1/d, (False,d+2)) | otherwise = Just (negate (1/d), (True,d+2)) ======================================== ghc -O2 [-fexcess-precision] --make: 4.22s ghc -O2 -fexcess-precision -fvia-C -optc-O3 --make: 3.02s Using lists instead of loops, List.hs ======================================== module Main (main) where import Data.List (unfoldr) main :: IO () main = do putStrLn "EPS: " eps <- readLn :: IO Double let mx = floor (4/eps) !k = (mx+1) `quot` 2 putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k) leibniz n = (4 *) $ sum $ take n step step :: [Double] step = unfoldr phi (True,1) where phi (sig,d) | sig = Just (1/d, (False,d+2)) | otherwise = Just (negate (1/d), (True,d+2)) ======================================== things are much slower, 23.60s vs. 18.15s, but the via-C route is again significantly faster.
What I've done for 6.14.1 is to add the -msse2 flag to the x86 backend, so where previously we had to use -fvia-C -fexcess-precision -optc-O3 etc. to get reasonable floating point performance, now we can use -msse2 with the native code gen and get about the same results.
Can I test whether I get about the same results as with -fvia-C ... for the above? I.e., is it in the HEAD, and would I have to pass -msse2 on the command line or is it implied by -O2 already?
In the future we have a couple of ways that things could get better:
1. The new back-end, which eventually will incorporate more optimisations at the C-- level, and potentially could produce good loop code. It will also free up some registers.
2. Compiling via LLVM.
Dropping the C backend will give us more flexibility with calling conventions, letting us use more of the x86 registers for passing arguments. We can only make this change by removing -fvia-C, though. There's low hanging fruit here particularly for the x86 backend, as soon as we drop -fvia-C.
There are other reasons to want to get rid of -fvia-C:
- it doubles the testing surface
- it's associated with a bucketload of grotesque Perl 4 code and gcc-specific hacks in the RTS headers.
Cheers, Simon