Re: Removing/deprecating -fvia-c

15 Feb 2010

      Am Montag 15 Februar 2010 17:37:55 schrieb Simon Marlow:
...
On 14/02/2010 17:58, Don Stewart wrote:
...
igloo:
...
Hi all,
We are planning to remove the -fvia-c way of compiling code
(unregisterised compilers will continue to compile via C only, but
registerised compilers will only use the native code generator).
We'll probably deprecate -fvia-c in the 6.14 branch, and remove it in
6.16.
Simon Marlow has recently fixed FP performance for modern x86 chips
in the native code generator in the HEAD. That was the last reason we
know of to prefer via-C to the native code generators. But before we
start the removal process, does anyone know of any other problems
with the native code generators that need to be fixed first?
Do we have the blessing of the DPH team, wrt. tight, numeric inner
loops?
As recently as last year -fvia-C -optc-O3 was still useful for some
microbenchmarks -- what's changed in that time, or is expected to
change?
If you have benchmarks that show a significant difference, I'd be
interested to see them.
I have a benchmark (or a couple) from the Beginners mailing list two weeks 
ago (thread starting in January at 
http://www.haskell.org/pipermail/beginners/2010-January/003356.html and 
continued in February at 
http://www.haskell.org/pipermail/beginners/2010-February/003373.html ff) 
which show a significant difference.

Loop.hs:
========================================
{-# LANGUAGE BangPatterns #-}
module Main (main) where

main :: IO ()
main = do
    putStrLn "EPS: "
    eps <- readLn :: IO Double
    let !mx = (4/eps)
        !pi14 = pisum mx
    putStrLn $ "PI mit EPS "++(show eps)++" = "++ show(4*pi14)

pisum :: Double -> Double
pisum cut = go True 1 0
      where
        go b n s | cut < n = if b then s+1/(2*n) else s-1/(2*n)
        go True n !s = go False (n+2) (s+recip n)
        go False n !s = go True (n+2) (s-recip n)
========================================

$ echo '1e-8' | time ./Loop

ghc -O2 --make:
4.53s
ghc -O2 -fexcess-precision --make:
4.54s
ghc -O2 -fvia-C -optc-O3 --make:
7.52s
ghc -O2 -fvia-C -optc-O3 -optc-ffast-math --make:
7.53s
ghc -O2 -fvia-C -optc-O3 -optc-ffast-math -optc-fno-float-store --make:
3.02s
ghc -O2 -fvia-C -optc-O3 -optc-fno-float-store --make:
3.02s
ghc -O2 -fexcess-precision -fvia-C -optc-O3 --make:
3.02s

The loop coded in C and compiled with gcc -O3 [-ffast-math, -fno-float-
store, -msse2 make no difference there] also takes 3.02s (gcc-4.3.2), 2.70s 
with icc -O3 (icc 11.0).

It is probably worth pointing out, however, that on Markus Böhm's box 
running Windows XP, the native code generator produced better code than the 
via-C route (NCG code was faster there than on my box [openSUSE 11.1], 
while -O2 -fexcess-precision -fvia-C -optc-O3 on his box was slower than 
NCG on mine).

Similar results for

Fusion.hs (uses stream-fusion package)
========================================
module Main (main) where

import qualified Data.List.Stream as S

main :: IO ()
main = do
    putStrLn "EPS: "
    eps <- readLn :: IO Double
    let !mx = floor (4/eps)
        !k = (mx+1) `quot` 2
    putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k)

leibniz n = (4 *) $ S.sum $ S.take n step

step :: [Double]
step = S.unfoldr phi (True,1) where
   phi (sig,d) | sig         = Just (1/d, (False,d+2))
               | otherwise   = Just (negate (1/d), (True,d+2))
========================================

ghc -O2 [-fexcess-precision] --make:
4.22s
ghc -O2 -fexcess-precision -fvia-C -optc-O3 --make:
3.02s

Using lists instead of loops,

List.hs
========================================
module Main (main) where

import Data.List (unfoldr)

main :: IO ()
main = do
    putStrLn "EPS: "
    eps <- readLn :: IO Double
    let mx = floor (4/eps)
        !k = (mx+1) `quot` 2
    putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k)

leibniz n = (4 *) $ sum $ take n step

step :: [Double]
step = unfoldr phi (True,1) where
   phi (sig,d) | sig         = Just (1/d, (False,d+2))
               | otherwise   = Just (negate (1/d), (True,d+2))
========================================

things are much slower, 23.60s vs. 18.15s, but the via-C route is again 
significantly faster.
...
What I've done for 6.14.1 is to add the -msse2 flag to the x86 backend,
so where previously we had to use -fvia-C -fexcess-precision -optc-O3
etc. to get reasonable floating point performance, now we can use -msse2
with the native code gen and get about the same results.
Can I test whether I get about the same results as with -fvia-C ... for the 
above?
I.e., is it in the HEAD, and would I have to pass -msse2 on the command 
line or is it implied by -O2 already?
...
In the future we have a couple of ways that things could get better:
1. The new back-end, which eventually will incorporate more
     optimisations at the C-- level, and potentially could produce
     good loop code.  It will also free up some registers.
2. Compiling via LLVM.
Dropping the C backend will give us more flexibility with calling
conventions, letting us use more of the x86 registers for passing
arguments.  We can only make this change by removing -fvia-C, though.
There's low hanging fruit here particularly for the x86 backend, as soon
as we drop -fvia-C.
There are other reasons to want to get rid of -fvia-C:
- it doubles the testing surface
- it's associated with a bucketload of grotesque Perl 4 code and
    gcc-specific hacks in the RTS headers.
Cheers,
  Simon

Re: Removing/deprecating -fvia-c

Daniel Fischer