Very fast loops. Now!

The following C program was described on #haskell
#include

dons:
The following C program was described on #haskell
#include
int main() { double x = 1.0/3.0; double y = 3.0; int i = 1; for (; i<=1000000000; i++) { x = x*y/3.0; y = x*9.0; } printf("%f\n", x+y); }
Which was translated to the following Haskell:
{-# OPTIONS -fexcess-precision #-}
import Text.Printf
main = go (1/3) 3 1
go :: Double -> Double -> Int -> IO () go !x !y !i | i == 1000000000 = printf "%f\n" (x+y) | otherwise = go (x*y/3) (x*9) (i+1)
To everyone's surprise, GHC 6.6 beats GCC (3.3.5) here, at least the two test machines:
$ ghc -O -fexcess-precision -fbang-patterns -optc-O3 -optc-ffast-math -optc-mfpmath=sse -optc-msse2 A.hs -o a
$ time ./a 3.333333 ./a 0.96s user 0.01s system 99% cpu 0.969 total ^^^^^
Versus gcc 3.3.5:
$ gcc -O3 -ffast-math -mfpmath=sse -msse2 -std=c99 t.c -o c_loop $ time ./c_loop 3.333333 ./c_loop 1.01s user 0.01s system 97% cpu 1.046 total ^^^^^
Note that newer gcc's will statically compute that loop. Note also that -fexcess-precision must currently be provided as a pragma only.
I declare GHC Haskell numerics (with -fexcess-precision) not so shabby!
GCC 4.x seems to do a much better job, turning the inner loop into: .L2: mulsd %xmm3, %xmm0 mulsd %xmm1, %xmm0 movapd %xmm0, %xmm1 mulsd %xmm2, %xmm1 addl $1, %eax cmpl $100000001, %eax jne .L2 -- Don

GCC 4.x gets a pass on this test. :) You can do (much) better than that, of course. But it's what I'd expect without going over board. -- Lennart On Feb 10, 2007, at 16:45 , Donald Bruce Stewart wrote:
dons:
The following C program was described on #haskell
#include
int main() { double x = 1.0/3.0; double y = 3.0; int i = 1; for (; i<=1000000000; i++) { x = x*y/3.0; y = x*9.0; } printf("%f\n", x+y); }
Which was translated to the following Haskell:
{-# OPTIONS -fexcess-precision #-}
import Text.Printf
main = go (1/3) 3 1
go :: Double -> Double -> Int -> IO () go !x !y !i | i == 1000000000 = printf "%f\n" (x+y) | otherwise = go (x*y/3) (x*9) (i+1)
To everyone's surprise, GHC 6.6 beats GCC (3.3.5) here, at least the two test machines:
$ ghc -O -fexcess-precision -fbang-patterns -optc-O3 -optc- ffast-math -optc-mfpmath=sse -optc-msse2 A.hs -o a
$ time ./a 3.333333 ./a 0.96s user 0.01s system 99% cpu 0.969 total ^^^^^
Versus gcc 3.3.5:
$ gcc -O3 -ffast-math -mfpmath=sse -msse2 -std=c99 t.c -o c_loop $ time ./c_loop 3.333333 ./c_loop 1.01s user 0.01s system 97% cpu 1.046 total ^^^^^
Note that newer gcc's will statically compute that loop. Note also that -fexcess-precision must currently be provided as a pragma only.
I declare GHC Haskell numerics (with -fexcess-precision) not so shabby!
GCC 4.x seems to do a much better job, turning the inner loop into:
.L2: mulsd %xmm3, %xmm0 mulsd %xmm1, %xmm0 movapd %xmm0, %xmm1 mulsd %xmm2, %xmm1 addl $1, %eax cmpl $100000001, %eax jne .L2
-- Don _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 2/10/07, Donald Bruce Stewart
The following C program was described on #haskell
#include
int main() { double x = 1.0/3.0; double y = 3.0; int i = 1; for (; i<=1000000000; i++) { x = x*y/3.0; y = x*9.0; } printf("%f\n", x+y); }
Which was translated to the following Haskell:
{-# OPTIONS -fexcess-precision #-}
import Text.Printf
main = go (1/3) 3 1
go :: Double -> Double -> Int -> IO () go !x !y !i | i == 1000000000 = printf "%f\n" (x+y) | otherwise = go (x*y/3) (x*9) (i+1)
I've compiled the exactly same programs, passing the same parameters to the compiler as you did, but I've got a very different result: $ ghc -O -fexcess-precision -fbang-patterns -optc-O3 -optc-ffast-math\ -optc-mfpmath=sse -optc-msse2 loop.hs -o haskell_loop $ time ./haskell_loop 3.3333333333333335 real 0m14.092s user 0m14.057s sys 0m0.036s $ ghc --version The Glorious Glasgow Haskell Compilation System, version 6.6 While the haskell program took so long, the C program went really faster than the haskell version: $ gcc -O3 -ffast-math -mfpmath=sse -msse2 -std=c99 loop.c -o c_loop $ time ./c_loop 3.333333 real 0m0.001s user 0m0.000s sys 0m0.000s $ gcc --version gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21) I'm using debian etch (linux), my processor is a pentium 4 3.0ghz, which has sse and sse2 (or at least /proc/cpuinfo says so). As for memory, it probably doesn't matter much in this test, but I have 512mB of ram. In a similar thread that was posted at comp.lang.functional (it was actually regarding ocaml vs C++, you can find it on google groups at http://tinyurl.com/292ps6 ) the same kind of discrepancy occurred. Showing that this kind of benchmarking is usually not very accurate or, at least, that my processor is not well suited for functional languages :P.

On 2/10/07, Rafael Almeida
While the haskell program took so long, the C program went really faster than the haskell version: $ gcc -O3 -ffast-math -mfpmath=sse -msse2 -std=c99 loop.c -o c_loop $ time ./c_loop 3.333333
real 0m0.001s user 0m0.000s sys 0m0.000s $ gcc --version gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
Under gcc (GCC) 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5), the following asm code is generated for part of the main function: mov dword ptr [esp+4], 0aaaaaaabh mov dword ptr [esp+8], 400aaaaah mov dword ptr [esp], data_804858c call wrapper_8049688_80482b4 where data_804858c is "%f\n" and wrapper_8049688_80482b4 is printf. Needless to say that the other argument is exactly the double 3.3333333333333335. In the OP's words, "newer gcc's will statically compute that loop". -- Felipe.

On 2/10/07, Felipe Almeida Lessa
Under gcc (GCC) 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5), the following asm code is generated for part of the main function:
mov dword ptr [esp+4], 0aaaaaaabh mov dword ptr [esp+8], 400aaaaah mov dword ptr [esp], data_804858c call wrapper_8049688_80482b4
where data_804858c is "%f\n" and wrapper_8049688_80482b4 is printf. Needless to say that the other argument is exactly the double 3.3333333333333335. In the OP's words, "newer gcc's will statically compute that loop".
What I found impressive was the time taken by the haskell version, 14s, much more than the haskell and C version of Donald's e-mail.

Yes! That's the code I like. On Feb 10, 2007, at 17:46 , Felipe Almeida Lessa wrote:
On 2/10/07, Rafael Almeida
wrote: While the haskell program took so long, the C program went really faster than the haskell version: $ gcc -O3 -ffast-math -mfpmath=sse -msse2 -std=c99 loop.c -o c_loop $ time ./c_loop 3.333333
real 0m0.001s user 0m0.000s sys 0m0.000s $ gcc --version gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
Under gcc (GCC) 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5), the following asm code is generated for part of the main function:
mov dword ptr [esp+4], 0aaaaaaabh mov dword ptr [esp+8], 400aaaaah mov dword ptr [esp], data_804858c call wrapper_8049688_80482b4
where data_804858c is "%f\n" and wrapper_8049688_80482b4 is printf. Needless to say that the other argument is exactly the double 3.3333333333333335. In the OP's words, "newer gcc's will statically compute that loop".
-- Felipe. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hello Rafael, Saturday, February 10, 2007, 8:26:38 PM, you wrote:
I'm using debian etch (linux), my processor is a pentium 4 3.0ghz, which has sse and sse2
Showing that this kind of benchmarking is usually not very accurate or, at least, that my processor is not well suited for functional languages
seems that your processor don't supports FPE :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

I think gcc 4.x have much better optimizations than 3.x since SSA added.I donot know if there are very good IR for functional language optimization. Is it hard for STG language to analysis this kind of code?> go !x !y !i
| i == 1000000000 = printf "%f\n" (x+y) | otherwise = go (x*y/3) (x*9) (i+1)
btw: I always hope to see a C/C++ language compiler writen by haskell.I think it might be much easier to add a new optimization :)
在2007-02-11,Rafael Almeida

On 2/10/07, Donald Bruce Stewart
To everyone's surprise, GHC 6.6 beats GCC (3.3.5) here, at least the two test machines:
$ ghc -O -fexcess-precision -fbang-patterns -optc-O3 -optc-ffast-math -optc-mfpmath=sse -optc-msse2 A.hs -o a
$ time ./a 3.333333 ./a 0.96s user 0.01s system 99% cpu 0.969 total ^^^^^
Versus gcc 3.3.5:
$ gcc -O3 -ffast-math -mfpmath=sse -msse2 -std=c99 t.c -o c_loop $ time ./c_loop 3.333333 ./c_loop 1.01s user 0.01s system 97% cpu 1.046 total ^^^^^
There's no doubt that GHC is doing well here, but is that really a statistically significant difference? Certainly, it's great that GHC and GCC are doing at least about equally well, but I wouldn't conclude just from that data that GHC *beats* GCC. Cheers, Kirsten -- Kirsten Chevalier* chevalier@alum.wellesley.edu *Often in error, never in doubt "Ninety-nine percent of everything that is done in the world, good and bad, is done to pay a mortgage. The world would be a much better place if everyone rented." -- Christopher Buckley

Donald Bruce Stewart wrote:
The following C program was described on #haskell
#include
int main() { double x = 1.0/3.0; double y = 3.0; int i = 1; for (; i<=1000000000; i++) { x = x*y/3.0; y = x*9.0; } printf("%f\n", x+y); }
Which was translated to the following Haskell:
{-# OPTIONS -fexcess-precision #-}
import Text.Printf
main = go (1/3) 3 1
go :: Double -> Double -> Int -> IO () go !x !y !i | i == 1000000000 = printf "%f\n" (x+y) | otherwise = go (x*y/3) (x*9) (i+1)
Do the two programs implement the same algorithm? The C program updates x and y in sequence. The Haskell program updates x and y in parallel and can be easier for the compiler to optimize.

Eric Willigers wrote:
Do the two programs implement the same algorithm? The C program updates x and y in sequence. The Haskell program updates x and y in parallel and can be easier for the compiler to optimize.
Hi Don, Expressing this in other words, do we want the new y to be based on the new x or on the old x? I extracted the code you posted to CS.c and HP.hs (C sequential, Hashell parallel). I made the following minor changes to form CP.c and HS.hs (C parallel, Hashell sequential):- double xn; for (; i<=1000000000; i++) { xn = x*y/3.0; y = x*9.0; x = xn; } | otherwise = go xs (xs*9) (i+1) where xs = x*y/3 Tested on a 2.8 GHz Pentium 4, running XP SP2 and cygwin, using the compiler options from your post. Each program was run once. $ uname -a CYGWIN_NT-5.1 nemo 1.5.21(0.156/4/2) 2006-07-30 14:21 i686 Cygwin $ gcc --version gcc (GCC) 3.4.4 (cygming special) (gdc 0.12, using dmd 0.125) $ ghc --version The Glorious Glasgow Haskell Compilation System, version 6.6 $ gcc -O3 -ffast-math -mfpmath=sse -msse2 -std=c99 CP.c -o CP $ time ./CP 3.333333 real 0m10.560s user 0m10.546s sys 0m0.015s $ gcc -O3 -ffast-math -mfpmath=sse -msse2 -std=c99 CS.c -o CS $ time ./CS 3.333333 real 0m10.788s user 0m10.718s sys 0m0.015s $ ghc -O -fexcess-precision -fbang-patterns -optc-O3 -optc-ffast-math -optc-mfpmath=sse -optc-msse2 HP.hs -o HP $ time ./HP 3.3333333333333335 real 1m8.550s user 0m0.015s sys 0m0.031s $ ghc -O -fexcess-precision -fbang-patterns -optc-O3 -optc-ffast-math -optc-mfpmath=sse -optc-msse2 HS.hs -o HS $ time ./HS 3.3333333333333335 real 1m9.425s user 0m0.015s sys 0m0.046s The differences between the P and S versions turned out to be incidental on my system. I downloaded GHC a month ago. Should I be running a more recent build or using different compiler options? Regards, Eric.

ewilligers:
Eric Willigers wrote:
Do the two programs implement the same algorithm? The C program updates x and y in sequence. The Haskell program updates x and y in parallel and can be easier for the compiler to optimize.
Hi Don,
Expressing this in other words, do we want the new y to be based on the new x or on the old x?
I extracted the code you posted to CS.c and HP.hs (C sequential, Hashell parallel).
I made the following minor changes to form CP.c and HS.hs (C parallel, Hashell sequential):-
double xn; for (; i<=1000000000; i++) { xn = x*y/3.0; y = x*9.0; x = xn; }
| otherwise = go xs (xs*9) (i+1) where xs = x*y/3
Tested on a 2.8 GHz Pentium 4, running XP SP2 and cygwin, using the compiler options from your post. Each program was run once.
$ uname -a CYGWIN_NT-5.1 nemo 1.5.21(0.156/4/2) 2006-07-30 14:21 i686 Cygwin
$ gcc --version gcc (GCC) 3.4.4 (cygming special) (gdc 0.12, using dmd 0.125)
$ ghc --version The Glorious Glasgow Haskell Compilation System, version 6.6
$ gcc -O3 -ffast-math -mfpmath=sse -msse2 -std=c99 CP.c -o CP
$ time ./CP 3.333333
real 0m10.560s user 0m10.546s sys 0m0.015s
$ gcc -O3 -ffast-math -mfpmath=sse -msse2 -std=c99 CS.c -o CS
$ time ./CS 3.333333
real 0m10.788s user 0m10.718s sys 0m0.015s
$ ghc -O -fexcess-precision -fbang-patterns -optc-O3 -optc-ffast-math -optc-mfpmath=sse -optc-msse2 HP.hs -o HP
$ time ./HP 3.3333333333333335
real 1m8.550s user 0m0.015s sys 0m0.031s
$ ghc -O -fexcess-precision -fbang-patterns -optc-O3 -optc-ffast-math -optc-mfpmath=sse -optc-msse2 HS.hs -o HS
$ time ./HS 3.3333333333333335
real 1m9.425s user 0m0.015s sys 0m0.046s
-fexcess-precision *must* be provided as a pragma,
{-# OPTIONS -fexcess-precision #-}
import Text.Printf
main = go (1/3) 3 1
go :: Double -> Double -> Int -> IO ()
go !x !y !i
| i == 1000000000 = printf "%.6f\n" (x+y)
| otherwise = go xn (xn*9) (i+1)
where xn = x*y/3
C source:
#include

On Tue, Feb 13, 2007 at 08:18:25AM +1100, Donald Bruce Stewart wrote:
Now, if we rewrite it to not use the temporary:
go :: Double -> Double -> Int -> IO () go !x !y !i | i == 1000000000 = printf "%.6f\n" (x+y) | otherwise = go (x*y/3) (x*9) (i+1)
for (; i<1000000000; i++) { x = x*y/3.0; y = x*9.0; }
------------------------------------------------------------------------
$ time ./hp 3.333333 ./hp 9.95s user 0.00s system 99% cpu 9.965 total
$ time ./cc 3.333333 ./cc 10.06s user 0.00s system 99% cpu 10.110 total
I'm rather curious (if you're sill interested) how this'll be affected by the removal of the division from the inner loop. e.g. go :: Double -> Double -> Int -> IO () go !x !y !i | i == 1000000000 = printf "%.6f\n" (x+y) | otherwise = go (x*y*(1.0/3)) (x*9) (i+1) for (; i<1000000000; i++) { x = x*y*(1.0/3.0); y = x*9.0; } My guess is that the code will be far faster, and that the differences between C and Haskell will therefore be more pronounced. After all, division is a slow operation... -- David Roundy Department of Physics Oregon State University

David Roundy wrote:
I'm rather curious (if you're sill interested) how this'll be affected by the removal of the division from the inner loop. e.g.
go :: Double -> Double -> Int -> IO () go !x !y !i | i == 1000000000 = printf "%.6f\n" (x+y) | otherwise = go (x*y*(1.0/3)) (x*9) (i+1)
for (; i<1000000000; i++) { x = x*y*(1.0/3.0); y = x*9.0; }
GCC will do the transformation itself if you use -ffast-math. It requires the flag as the results aren't exactly numerically equivalent.

On Mon, Feb 12, 2007 at 02:25:21PM -0800, Bryan O'Sullivan wrote:
David Roundy wrote:
I'm rather curious (if you're sill interested) how this'll be affected by the removal of the division from the inner loop. e.g.
go :: Double -> Double -> Int -> IO () go !x !y !i | i == 1000000000 = printf "%.6f\n" (x+y) | otherwise = go (x*y*(1.0/3)) (x*9) (i+1)
for (; i<1000000000; i++) { x = x*y*(1.0/3.0); y = x*9.0; }
GCC will do the transformation itself if you use -ffast-math. It requires the flag as the results aren't exactly numerically equivalent.
Ah, okay. Never mind, then. -- David Roundy Department of Physics Oregon State University

On Sun, 11 Feb 2007, Donald Bruce Stewart wrote:
The following C program was described on #haskell
#include
int main() { double x = 1.0/3.0; double y = 3.0; int i = 1; for (; i<=1000000000; i++) { x = x*y/3.0; y = x*9.0; } printf("%f\n", x+y); }
Which was translated to the following Haskell:
{-# OPTIONS -fexcess-precision #-}
import Text.Printf
main = go (1/3) 3 1
go :: Double -> Double -> Int -> IO () go !x !y !i | i == 1000000000 = printf "%f\n" (x+y) | otherwise = go (x*y/3) (x*9) (i+1)
No one doubts, that it is possible to write efficient code in Haskell. But how does idiomatic Haskell code perform?
participants (11)
-
Bryan O'Sullivan
-
Bulat Ziganshin
-
David Roundy
-
dons@cse.unsw.edu.au
-
Eric Willigers
-
Felipe Almeida Lessa
-
Henning Thielemann
-
Kirsten Chevalier
-
Lennart Augustsson
-
Ouyang Jian
-
Rafael Almeida