Speed performance problem on Windows?

Travis Erdman

5 Mar 2010 5 Mar '10

11:20 p.m.

I'm working through one of Don Stewart's many excellent articles ... http://cgi.cse.unsw.edu.au/~dons/blog/2008/06/04#fast-fusion I faithfully re-created the source of his initial GHC reference implementation as: import System.Environment import Text.Printf mean :: Double -> Double -> Double mean n m = go 0 0 n where go :: Double -> Int -> Double -> Double go s l x | x>m = s / fromIntegral l | otherwise = go (s+x) (l+1) (x+1) main = do [d] <- map read `fmap` getArgs printf "%f\n" (mean 1 d) Then, compiled and executed like this: C:\Documents and Settings\Travis\My Documents\Haskell Code>ghc -O2 biglistmean.hs -optc-O2 -fvia-C --make -fforce-recomp [1 of 1] Compiling Main ( biglistmean.hs, biglistmean.o ) Linking biglistmean.exe ... C:\Documents and Settings\Travis\My Documents\Haskell Code>biglistmean 1000000 500000.5 C:\Documents and Settings\Travis\My Documents\Haskell Code>biglistmean 10000000 5000000.5 C:\Documents and Settings\Travis\My Documents\Haskell Code>biglistmean 100000000 50000000.5 C:\Documents and Settings\Travis\My Documents\Haskell Code>biglistmean 1000000000 500000000.067109 On the final test of 10^9, Don reports that it took 1.76 secs on his machine. In contrast, just 10^8 takes 12.63 secs on my machine (sophisticatedly timed with handheld stopwatch) and on the coup de grace 10^9 test, it takes 2min:04secs. Yikes! My hardware is a little old (Win XP on Pentium 4 3.06GHz w 2 GB RAM) but not THAT old. I'm using the latest Haskell Platform which includes ghc v 6.10.4. Primary question: What gives here? Incidental questions: Is there a nice way to time executed code in Windows ala the "time" command Don shows under Linux? Also, does the ordering of the compiler flags have any impact (I hope not, but I don't want to be surprised ...) Thanks, Travis Erdman

Attachments:

attachment.html (text/html — 7.8 KB)

Show replies by date

Daniel Fischer

6 Mar 6 Mar

12:36 a.m.

Am Samstag 06 März 2010 00:20:52 schrieb Travis Erdman:

...

I'm working through one of Don Stewart's many excellent articles ...

http://cgi.cse.unsw.edu.au/~dons/blog/2008/06/04#fast-fusion

I faithfully re-created the source of his initial GHC reference implementation as: <snip>

Then, compiled and executed like this:

C:\Documents and Settings\Travis\My Documents\Haskell Code>ghc -O2 biglistmean.hs -optc-O2 -fvia-C --make -fforce-recomp [1 of 1] Compiling Main ( biglistmean.hs, biglistmean.o ) Linking biglistmean.exe ...

Not the best combination of options, for me at least. On my box, that is approximately 35% slower than -O2 with the native code generator.

...

On the final test of 10^9, Don reports that it took 1.76 secs on his machine.

Well, Don has a super fast 64-bit thingy, on normal machines, all code runs much slower than on Don's :)

...

In contrast, just 10^8 takes 12.63 secs on my machine

But not that much slower, ouch. On my machine, 10^8 takes ~3.8s compiled with -O2 -fvia-C -optc-O2 [or -optc-O3, doesn't make a difference] ~2.8s compiled with -O2 [with and without -fexcess-precision] ~1.18s compiled with -O2 -fexcess-precision -fvia-C -optc-O3 Floating point arithmetic compiled via C profits greatly from -fexcess- precision (well, at least on my system, YMMV). Alas, equivalent gcc-compiled C code takes only 0.35s for 10^8 (0.36 with icc). Multiply all timings by 10 for 10^9.

...

(sophisticatedly timed with handheld stopwatch) and on the coup de grace 10^9 test, it takes 2min:04secs. Yikes! My hardware is a little old (Win XP on Pentium 4 3.06GHz w 2 GB RAM) but not THAT old. I'm using the latest Haskell Platform which includes ghc v 6.10.4.

I also have 3.06GHz P4 (2 cores, 1 GB RAM), running openSuSE 11.1 and ghc-6.12.1, ghc-6.10.3 (no difference between 6.10 and 6.12 for this loop). The P4 isn't particularly fast, unfortunately.

...

Primary question: What gives here?

GCC on XP sucks. Big time, AFAIK. Compile your stuff once via C and once with the native code generator and compare. I think you'll almost always find the NCG faster, sometimes very much.

...

Incidental questions: Is there a nice way to time executed code in Windows ala the "time" command Don shows under Linux?

There's timeit.exe, as linked to in http://channel9.msdn.com/forums/Coffeehouse/258979-Windows-equivalent-of- UnixLinux-time-command/

...

Also, does the ordering of the compiler flags have any impact (I hope not, but I don't want to be surprised ...)

Depends. If you give conflicting options, the last takes precedence (unless some combination gives an error, don't know if that happens). If the options aren't conflicting, the order doesn't matter.

...

Thanks,

Travis Erdman

MAN

6:50 p.m.

For the record, I'm adding my numbers to the pool: Calling "bigmean1.hs" to the first piece of code (the recursive version) and "bigmean2.hs" to the second (the one using 'foldU'), I compiled four versions of the two and timed them while they computed the mean of [1..1e9]. Here are the results: MY SYSTEM (512 RAM, Mobile AMD Sempron(tm) 3400+ proc [1 core]) (you're run-o-the-mill Ubuntu laptop): ~$ uname -a Linux dy-book 2.6.31-19-generic #56-Ubuntu SMP Thu Jan 28 01:26:53 UTC 2010 i686 GNU/Linux ~$ ghc -V The Glorious Glasgow Haskell Compilation System, version 6.12.1 RUN 1 - C generator, without excess-precision ~$ ghc -o bigmean1 --make -fforce-recomp -O2 -fvia-C -optc-O3 bigmean1.hs ~$ ghc -o bigmean2 --make -fforce-recomp -O2 -fvia-C -optc-O3 bigmean2.hs ~$ time ./bigmean1 1e9 500000000.067109 real 0m47.685s user 0m47.655s sys 0m0.000s ~$ time ./bigmean2 1e9 500000000.067109 real 1m4.696s user 1m4.324s sys 0m0.028s RUN 2 - default generator, no excess-precision ~$ ghc --make -O2 -fforce-recomp -o bigmean2-noC bigmean2.hs ~$ ghc --make -O2 -fforce-recomp -o bigmean1-noC bigmean1.hs ~$ time ./bigmean1-noC 1e9 500000000.067109 real 0m16.571s user 0m16.493s sys 0m0.012s ~$ time ./bigmean2-noC 1e9 500000000.067109 real 0m27.146s user 0m27.086s sys 0m0.004s RUN 3 - C generator, with excess-precision. ~$ ghc --make -fforce-recomp -O2 -fvia-C -optc-O3 -fexcess-precision -o bigmean1-precis bigmean1.hs ~$ ghc --make -fforce-recomp -O2 -fvia-C -optc-O3 -fexcess-precision -o bigmean2-precis bigmean2.hs ~$ time ./bigmean1-precis 1e9 500000000.067109 real 0m11.937s user 0m11.841s sys 0m0.012s ~$ time ./bigmean2-precis 1e9 500000000.067109 real 0m17.105s user 0m17.081s sys 0m0.004s RUN 4 - default generator, with excess-precision ~$ ghc --make -fforce-recomp -O2 -fexcess-precision -o bigmean1-precis bigmean1.hs ~$ ghc --make -fforce-recomp -O2 -fexcess-precision -o bigmean2-precis bigmean2.hs ~$ time ./bigmean1-precis 1e9 500000000.067109 real 0m16.521s user 0m16.413s sys 0m0.008s ~$ time ./bigmean2-precis 1e9 500000000.067109 real 0m27.381s user 0m27.190s sys 0m0.016s CONCLUSIONS: · Big difference between the two versions (recursive and fusion-oriented). I check compiling with -ddump-simple-stats, and the rule mention in Don's article IS being fired (streamU/unstraemU) once. The recursive expression of the algorithm is quite faster · Big gain adding the excess-precision flag to the compiling step, even if not using the C code generator. · The best time is achieved compiling through the C generator, with excess-precis flag; second best (5 seconds away in execution) is adding the same flag to the default generator. I didn't know of the -fexcess-precision. It really makes a BIG difference to number cruncher modules :D El sáb, 06-03-2010 a las 01:36 +0100, Daniel Fischer escribió:

...

Am Samstag 06 März 2010 00:20:52 schrieb Travis Erdman:

...
I'm working through one of Don Stewart's many excellent articles ...

http://cgi.cse.unsw.edu.au/~dons/blog/2008/06/04#fast-fusion

I faithfully re-created the source of his initial GHC reference implementation as: <snip>

Then, compiled and executed like this:

C:\Documents and Settings\Travis\My Documents\Haskell Code>ghc -O2 biglistmean.hs -optc-O2 -fvia-C --make -fforce-recomp [1 of 1] Compiling Main ( biglistmean.hs, biglistmean.o ) Linking biglistmean.exe ...

Not the best combination of options, for me at least. On my box, that is approximately 35% slower than -O2 with the native code generator.

...
On the final test of 10^9, Don reports that it took 1.76 secs on his machine.

Well, Don has a super fast 64-bit thingy, on normal machines, all code runs much slower than on Don's :)

...
In contrast, just 10^8 takes 12.63 secs on my machine

But not that much slower, ouch.

On my machine, 10^8 takes ~3.8s compiled with -O2 -fvia-C -optc-O2 [or -optc-O3, doesn't make a difference] ~2.8s compiled with -O2 [with and without -fexcess-precision] ~1.18s compiled with -O2 -fexcess-precision -fvia-C -optc-O3

Floating point arithmetic compiled via C profits greatly from -fexcess- precision (well, at least on my system, YMMV).

Alas, equivalent gcc-compiled C code takes only 0.35s for 10^8 (0.36 with icc).

Multiply all timings by 10 for 10^9.

...
(sophisticatedly timed with handheld stopwatch) and on the coup de grace 10^9 test, it takes 2min:04secs. Yikes! My hardware is a little old (Win XP on Pentium 4 3.06GHz w 2 GB RAM) but not THAT old. I'm using the latest Haskell Platform which includes ghc v 6.10.4.

I also have 3.06GHz P4 (2 cores, 1 GB RAM), running openSuSE 11.1 and ghc-6.12.1, ghc-6.10.3 (no difference between 6.10 and 6.12 for this loop). The P4 isn't particularly fast, unfortunately.

...
Primary question: What gives here?

GCC on XP sucks. Big time, AFAIK. Compile your stuff once via C and once with the native code generator and compare. I think you'll almost always find the NCG faster, sometimes very much.

...
Incidental questions: Is there a nice way to time executed code in Windows ala the "time" command Don shows under Linux?

There's timeit.exe, as linked to in http://channel9.msdn.com/forums/Coffeehouse/258979-Windows-equivalent-of- UnixLinux-time-command/

...
Also, does the ordering of the compiler flags have any impact (I hope not, but I don't want to be surprised ...)

Depends. If you give conflicting options, the last takes precedence (unless some combination gives an error, don't know if that happens). If the options aren't conflicting, the order doesn't matter.

...
Thanks,

Travis Erdman

_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

Daniel Fischer

10:25 p.m.

Am Samstag 06 März 2010 19:50:46 schrieb MAN:

...

For the record, I'm adding my numbers to the pool:

Calling "bigmean1.hs" to the first piece of code (the recursive version) and "bigmean2.hs" to the second (the one using 'foldU'), I compiled four versions of the two and timed them while they computed the mean of [1..1e9]. Here are the results:

MY SYSTEM (512 RAM, Mobile AMD Sempron(tm) 3400+ proc [1 core]) (you're run-o-the-mill Ubuntu laptop): ~$ uname -a Linux dy-book 2.6.31-19-generic #56-Ubuntu SMP Thu Jan 28 01:26:53 UTC 2010 i686 GNU/Linux ~$ ghc -V The Glorious Glasgow Haskell Compilation System, version 6.12.1

RUN 1 - C generator, without excess-precision

~$ ghc -o bigmean1 --make -fforce-recomp -O2 -fvia-C -optc-O3 bigmean1.hs ~$ ghc -o bigmean2 --make -fforce-recomp -O2 -fvia-C -optc-O3 bigmean2.hs

~$ time ./bigmean1 1e9 500000000.067109

real 0m47.685s user 0m47.655s sys 0m0.000s

~$ time ./bigmean2 1e9 500000000.067109

real 1m4.696s user 1m4.324s sys 0m0.028s

RUN 2 - default generator, no excess-precision

~$ ghc --make -O2 -fforce-recomp -o bigmean2-noC bigmean2.hs ~$ ghc --make -O2 -fforce-recomp -o bigmean1-noC bigmean1.hs

~$ time ./bigmean1-noC 1e9 500000000.067109

real 0m16.571s user 0m16.493s sys 0m0.012s

That's pretty good (not in comparison to Don's times, but in comparison to the other timings).

...

~$ time ./bigmean2-noC 1e9 500000000.067109

real 0m27.146s user 0m27.086s sys 0m0.004s

That's roughly the time I get with -O2 and the NCG, 27.3s for the explicit recursion, 25.9s for the stream-fusion. However, I can bring the explicit recursion down to 24.8s by reordering the parameters, mean :: Double -> Double -> Double mean n m = go 0 n 0 where go :: Int -> Double -> Double -> Double go l x s | x > m = s / fromIntegral l | otherwise = go (l+1) (x+1) (s+x) (or up to 40.8s by making the Int the last parameter). I had no idea the ordering of the parameters could have such a big impact even in simple cases like this. Anyway, the difference between NCG and via-C (without excess-precision) on your system is astonishingly large. What version of GCC have you (mine is 4.3.2)?

...

RUN 3 - C generator, with excess-precision.

~$ ghc --make -fforce-recomp -O2 -fvia-C -optc-O3 -fexcess-precision -o bigmean1-precis bigmean1.hs ~$ ghc --make -fforce-recomp -O2 -fvia-C -optc-O3 -fexcess-precision -o bigmean2-precis bigmean2.hs

~$ time ./bigmean1-precis 1e9 500000000.067109

real 0m11.937s user 0m11.841s sys 0m0.012s

Roughly the same time here, both, explicit recursion and stream-fusion.

...

~$ time ./bigmean2-precis 1e9 500000000.067109

real 0m17.105s user 0m17.081s sys 0m0.004s

RUN 4 - default generator, with excess-precision

~$ ghc --make -fforce-recomp -O2 -fexcess-precision -o bigmean1-precis bigmean1.hs ~$ ghc --make -fforce-recomp -O2 -fexcess-precision -o bigmean2-precis bigmean2.hs

~$ time ./bigmean1-precis 1e9 500000000.067109

real 0m16.521s user 0m16.413s sys 0m0.008s

~$ time ./bigmean2-precis 1e9

500000000.067109

real 0m27.381s user 0m27.190s sys 0m0.016s

NCG, -O2: Fusion: 25.86user 0.05system 0:25.91elapsed 100%CPU Explicit: 27.34user 0.02system 0:27.48elapsed 99%CPU Explicit reordered: 24.84user 0.00system 0:24.91elapsed 99%CPU NCG, -O2 -fexcess-precision: Fusion: 25.84user 0.00system 0:25.86elapsed 99%CPU Explicit: 27.32user 0.02system 0:27.41elapsed 99%CPU Explicit reordered: 24.86user 0.00system 0:24.86elapsed 100%CPU -O2 -fvia-C -optc-O3: [1] Fusion: 38.44user 0.01system 0:38.45elapsed 99%CPU 24.92user 0.00system 0:24.92elapsed 100%CPU Explicit: 37.50user 0.02system 0:37.53elapsed 99%CPU 26.61user 0.00system 0:26.61elapsed 99%CPU Explicit reordered: 38.13user 0.00system 0:38.14elapsed 100%CPU 24.94user 0.02system 0:24.96elapsed 100%CPU -O2 -fexcess-precision -fvia-C -optc-O3: Fusion: 11.90user 0.01system 0:11.92elapsed 99%CPU Explicit: 11.80user 0.00system 0:11.86elapsed 99%CPU Explicit reordered: 11.81user 0.00system 0:11.81elapsed 100%CPU

...

CONCLUSIONS: · Big difference between the two versions (recursive and fusion-oriented).

Odd. It shouldn't be a big difference, and here it isn't. Both should compile to almost the same machine code [however, the ordering of the parameters matters, you might try to shuffle them around a bit and see what that gives (If I swap the Int and the Double in the strict pair of the fusion code, I get a drastic performance penalty, perhaps you'll gain performance thus)].

...

I check compiling with -ddump-simple-stats, and the rule mention in Don's article IS being fired (streamU/unstraemU) once. The recursive expression of the algorithm is quite faster · Big gain adding the excess-precision flag to the compiling step, even if not using the C code generator.

I think you looked at the wrong numbers there, for the native code generator, the times with and without -fexcess-precision are very close, both for explicit recursion and fusion.

...

· The best time is achieved compiling through the C generator, with excess-precis flag; second best (5 seconds away in execution) is adding

Yes. If you are doing lots of floating-point operations and compile via C, better tell the C compiler that it shouldn't truncate every single intermediate result to 64 bit doubles, that takes time. There are two ways to do that, you can tell GHC that you don't want to truncate (-fexcess-precision), then GHC tells the C compiler [gcc], or you can tell gcc directly [well, that's via GHC's command line too :) ] by using -optc-fno-float-store. For the NCG, -fexcess-precision doesn't seem to make a difference (at least with Doubles, may make a big difference with Floats).

...

the same flag to the default generator.

I didn't know of the -fexcess-precision. It really makes a BIG difference to number cruncher modules :D

Via C. [1] This is really irritating. These timings come from the very same binaries, and I haven't noticed such behaviour from any of my other programmes. Normally, these programmes take ~38s, but every now and then, there's a run taking ~25/26s. The times for the slower runs are pretty stable, and the times for the fast runs are pretty stable (a few hundredths of a second difference). Of course, the running time of a programme (for the same input) depends on processor load, how many other processes want how many of the registers and such, but I would expect much less regular timings from those effects. Baffling.

MAN

8 Mar 8 Mar

3:19 a.m.

To answer your question, I run gcc-4.4.1 (default with Ubuntu 9.10). I took your advice, and made a few more tests. After reordering both the recursive and stream-fusion oriented versions I compiled and tested as follows: FOR THE NCG, WITH excess precision ON: ~$ ghc --make -fforce-recomp -O2 -fexcess-precision -o bm1-reordered bigmean1-reordered.hs ~$ ghc --make -fforce-recomp -O2 -fexcess-precision -o bm2-reordered bigmean2-reordered.hs ~$ time ./bm1-reordered 10e8 500000000.067109 real 0m13.330s user 0m13.285s sys 0m0.004s ~$ time ./bm2-reordered 10e8 500000000.067109 real 0m23.473s user 0m23.433s sys 0m0.008s [[ Recall that the previous time's where: ~$ time ./bigmean1-precis 1e9 500000000.067109 real 0m16.521s user 0m16.413s sys 0m0.008s ~$ time ./bigmean2-precis 1e9 500000000.067109 real 0m27.381s user 0m27.190s sys 0m0.016s ]] TURNING excess precision OFF, TO SEE ITS IMPACT IN THE IMPROVEMENT: ~$ ghc --make -fforce-recomp -O2 -o bm1-reordered-noEP bigmean1-reordered.hs ~$ ghc --make -fforce-recomp -O2 -o bm2-reordered-noEP bigmean2-reordered.hs ~$ time ./bm1-reordered-noEP 10e8 500000000.067109 real 0m13.306s user 0m13.277s sys 0m0.004s ~$ time ./bm1-reordered-noEP 10e8 500000000.067109 real 0m23.523s user 0m23.441s sys 0m0.000s Which is great! This way of compiling is much more comfortable (?). It is still odd that swapping the types has such impact in the performance of both... Any ideas? I then tried the same code, compiling with '-fvia-C -optc-O3 -fexcess-precision' and obtained the following (smoking hot) results: ~$ time ./bm1-reord-C 10e8 500000000.067109 real 0m9.630s user 0m9.617s sys 0m0.000s ~$ time ./bm2-reord-C 10e8 500000000.067109 real 0m17.837s user 0m17.769s sys 0m0.028s [[ Recall that previous time's for this same compilation run were ~$ time ./bigmean1-precis 1e9 500000000.067109 real 0m11.937s user 0m11.841s sys 0m0.012s ~$ time ./bigmean2-precis 1e9 500000000.067109 real 0m17.105s user 0m17.081s sys 0m0.004s ]] So, the improvement it's not so evident here on the fusion code, but the recursive implementation is notably faster. It seems every time Daniel suggests some little change time's drop a couple of secs... so... any more ideas? :D Seriously, now, why does argument order matter so much. More importantly: is this common and predictable? Should I start putting all my Int params at the front of the type signature? Thanks for the tips, btw, I've learned a couple of very important things as I re-read this thread. Elvio. El sáb, 06-03-2010 a las 23:25 +0100, Daniel Fischer escribió:

...

Am Samstag 06 März 2010 19:50:46 schrieb MAN:

...
For the record, I'm adding my numbers to the pool:

Calling "bigmean1.hs" to the first piece of code (the recursive version) and "bigmean2.hs" to the second (the one using 'foldU'), I compiled four versions of the two and timed them while they computed the mean of [1..1e9]. Here are the results:

MY SYSTEM (512 RAM, Mobile AMD Sempron(tm) 3400+ proc [1 core]) (you're run-o-the-mill Ubuntu laptop): ~$ uname -a Linux dy-book 2.6.31-19-generic #56-Ubuntu SMP Thu Jan 28 01:26:53 UTC 2010 i686 GNU/Linux ~$ ghc -V The Glorious Glasgow Haskell Compilation System, version 6.12.1

RUN 1 - C generator, without excess-precision

~$ ghc -o bigmean1 --make -fforce-recomp -O2 -fvia-C -optc-O3 bigmean1.hs ~$ ghc -o bigmean2 --make -fforce-recomp -O2 -fvia-C -optc-O3 bigmean2.hs

~$ time ./bigmean1 1e9 500000000.067109

real 0m47.685s user 0m47.655s sys 0m0.000s

~$ time ./bigmean2 1e9 500000000.067109

real 1m4.696s user 1m4.324s sys 0m0.028s

RUN 2 - default generator, no excess-precision

~$ ghc --make -O2 -fforce-recomp -o bigmean2-noC bigmean2.hs ~$ ghc --make -O2 -fforce-recomp -o bigmean1-noC bigmean1.hs

~$ time ./bigmean1-noC 1e9 500000000.067109

real 0m16.571s user 0m16.493s sys 0m0.012s

That's pretty good (not in comparison to Don's times, but in comparison to the other timings).

...
~$ time ./bigmean2-noC 1e9 500000000.067109

real 0m27.146s user 0m27.086s sys 0m0.004s

That's roughly the time I get with -O2 and the NCG, 27.3s for the explicit recursion, 25.9s for the stream-fusion. However, I can bring the explicit recursion down to 24.8s by reordering the parameters,

mean :: Double -> Double -> Double mean n m = go 0 n 0 where go :: Int -> Double -> Double -> Double go l x s | x > m = s / fromIntegral l | otherwise = go (l+1) (x+1) (s+x)

(or up to 40.8s by making the Int the last parameter).

I had no idea the ordering of the parameters could have such a big impact even in simple cases like this.

Anyway, the difference between NCG and via-C (without excess-precision) on your system is astonishingly large. What version of GCC have you (mine is 4.3.2)?

...
RUN 3 - C generator, with excess-precision.

~$ ghc --make -fforce-recomp -O2 -fvia-C -optc-O3 -fexcess-precision -o bigmean1-precis bigmean1.hs ~$ ghc --make -fforce-recomp -O2 -fvia-C -optc-O3 -fexcess-precision -o bigmean2-precis bigmean2.hs

~$ time ./bigmean1-precis 1e9 500000000.067109

real 0m11.937s user 0m11.841s sys 0m0.012s

Roughly the same time here, both, explicit recursion and stream-fusion.

...
~$ time ./bigmean2-precis 1e9 500000000.067109

real 0m17.105s user 0m17.081s sys 0m0.004s

RUN 4 - default generator, with excess-precision

~$ ghc --make -fforce-recomp -O2 -fexcess-precision -o bigmean1-precis bigmean1.hs ~$ ghc --make -fforce-recomp -O2 -fexcess-precision -o bigmean2-precis bigmean2.hs

~$ time ./bigmean1-precis 1e9 500000000.067109

real 0m16.521s user 0m16.413s sys 0m0.008s

~$ time ./bigmean2-precis 1e9

500000000.067109

real 0m27.381s user 0m27.190s sys 0m0.016s

NCG, -O2: Fusion: 25.86user 0.05system 0:25.91elapsed 100%CPU Explicit: 27.34user 0.02system 0:27.48elapsed 99%CPU Explicit reordered: 24.84user 0.00system 0:24.91elapsed 99%CPU

NCG, -O2 -fexcess-precision: Fusion: 25.84user 0.00system 0:25.86elapsed 99%CPU Explicit: 27.32user 0.02system 0:27.41elapsed 99%CPU Explicit reordered: 24.86user 0.00system 0:24.86elapsed 100%CPU

-O2 -fvia-C -optc-O3: [1] Fusion: 38.44user 0.01system 0:38.45elapsed 99%CPU 24.92user 0.00system 0:24.92elapsed 100%CPU Explicit: 37.50user 0.02system 0:37.53elapsed 99%CPU 26.61user 0.00system 0:26.61elapsed 99%CPU Explicit reordered: 38.13user 0.00system 0:38.14elapsed 100%CPU 24.94user 0.02system 0:24.96elapsed 100%CPU

-O2 -fexcess-precision -fvia-C -optc-O3: Fusion: 11.90user 0.01system 0:11.92elapsed 99%CPU Explicit: 11.80user 0.00system 0:11.86elapsed 99%CPU Explicit reordered: 11.81user 0.00system 0:11.81elapsed 100%CPU

...
CONCLUSIONS: · Big difference between the two versions (recursive and fusion-oriented).

Odd. It shouldn't be a big difference, and here it isn't. Both should compile to almost the same machine code [however, the ordering of the parameters matters, you might try to shuffle them around a bit and see what that gives (If I swap the Int and the Double in the strict pair of the fusion code, I get a drastic performance penalty, perhaps you'll gain performance thus)].

...
I check compiling with -ddump-simple-stats, and the rule mention in Don's article IS being fired (streamU/unstraemU) once. The recursive expression of the algorithm is quite faster · Big gain adding the excess-precision flag to the compiling step, even if not using the C code generator.

I think you looked at the wrong numbers there, for the native code generator, the times with and without -fexcess-precision are very close, both for explicit recursion and fusion.

...
· The best time is achieved compiling through the C generator, with excess-precis flag; second best (5 seconds away in execution) is adding

Yes. If you are doing lots of floating-point operations and compile via C, better tell the C compiler that it shouldn't truncate every single intermediate result to 64 bit doubles, that takes time. There are two ways to do that, you can tell GHC that you don't want to truncate (-fexcess-precision), then GHC tells the C compiler [gcc], or you can tell gcc directly [well, that's via GHC's command line too :) ] by using -optc-fno-float-store.

For the NCG, -fexcess-precision doesn't seem to make a difference (at least with Doubles, may make a big difference with Floats).

...
the same flag to the default generator.

I didn't know of the -fexcess-precision. It really makes a BIG difference to number cruncher modules :D

Via C.

[1] This is really irritating. These timings come from the very same binaries, and I haven't noticed such behaviour from any of my other programmes. Normally, these programmes take ~38s, but every now and then, there's a run taking ~25/26s. The times for the slower runs are pretty stable, and the times for the fast runs are pretty stable (a few hundredths of a second difference). Of course, the running time of a programme (for the same input) depends on processor load, how many other processes want how many of the registers and such, but I would expect much less regular timings from those effects. Baffling.

MAN

3:40 a.m.

A little something I should have done before: Just compiled Don's C code (modified to my 32 bit laptop) with all flavors of precision. All the timings were similar: ~$ gcc bigmean.c -o bgC ~$ chmod +x bgC ~$ ./bgC 1000000000 ~$ time ./bgC 1000000000 500000000.067109 real 0m8.585s user 0m8.553s sys 0m0.000s So the time's I've been getting are approximating C speed quite well. There's still the difference between recursive and fusion code that haven't been able to gap...

Kyle Murphy

4:30 p.m.

I'm sort of jumping into this conversation late, and I'm definitely a Haskell newbie, but I have to wonder if the speed differences don't have something to do with the C arguments passing conventions. I know there's some rule that says if your first couple args are int's to pass them in the CPU registers which might explain some of the speed boost for putting them first. I need to go dig through my x86 reference manuals to get the exact rules though. -R. Kyle Murphy -- Curiosity was framed, Ignorance killed the cat. On Sun, Mar 7, 2010 at 22:40, MAN wrote:

...

A little something I should have done before:

Just compiled Don's C code (modified to my 32 bit laptop) with all flavors of precision. All the timings were similar:

~$ gcc bigmean.c -o bgC ~$ chmod +x bgC ~$ ./bgC 1000000000 ~$ time ./bgC 1000000000 500000000.067109

real 0m8.585s user 0m8.553s sys 0m0.000s

So the time's I've been getting are approximating C speed quite well. There's still the difference between recursive and fusion code that haven't been able to gap...

_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

5612

Age (days ago)

5615

Last active (days ago)

List overview

Download

6 comments

4 participants

participants (4)

Daniel Fischer
Kyle Murphy
MAN
Travis Erdman