
Am Freitag, 20. Februar 2009 18:10 schrieb Bulat Ziganshin:
Hello Don,
Friday, February 20, 2009, 7:41:33 PM, you wrote:
main = print $ sum[1..10^9::Int]
This won't be comparable to your loop below, as 'sum' is a left fold (which doesn't fuse under build/foldr).
You should use the list implementation from the stream-fusion package (or uvector) if you're expecting it to fuse to the following loop:
it was comparison of native haskell, low-level haskell (which is harder to write than native C)
Um, not always, and certainly not in cases like this, at least if you call the simple worker loop "low-level Haskell".
and native C. stream-fusion and any other packages provides libraries for some tasks but they can't make faster maps, for example. so i used plain list
Which is of course comparable with a tight loop in C, isn't it? Really, you hurt your cause by including that. You said you wanted to compare ghc to gcc, then compare what they do to comparable code.
Which seems ... OK.
really? :D
Well, that's a bit different. It doesn't print the result, and it returns a different results on 64 bit....
doesn't matter for testing speed
I don't get anything near the 0.062s which is interesting.
it was beautiful gcc optimization - it added 8 values at once. with xor results are:
xor.hs 12.605 xor-fast.hs 1.856 xor.cpp 0.339
The print statement slows things down, I guess...
are you really believe that printing one number needs so much time? :)
So we have:
ghc -fvia-C -O2 1.127 ghc -fasm 1.677 gcc -O0 4.500 gcc -O3 -funroll-loops 0.318
why not compare to ghc -O0? also you can disable loop unrolling in gcc and unroll loops manually in haskell. or you can generate asm code on the fly. there are plenty of tricks to "prove" that gcc generates bad code :D
That's not what he's doing at all. Sure, he's not comparing Haskell code compiled without optimisations, but he also includes gcc with highest optimisation level. Read the gcc -O0 figure as an indication of what optimisations can achieve.
So. some lessons. GHC is around 3-4x slower on this tight loop. (Which isn't as bad as it used to be).
really? what i see: low-level haskell code is usually 3 times harder to write and 3 times slower than gcc code.
I deny that low-level Haskell code is three times harder to write than ordinary C code, at least if we consider the worker/wrapper idiom low-level Haskell. It is also my experience that gcc usually creates faster executables from good C code than ghc does from corresponding ordinary Haskell code (not using #-magic), but the margin does vary wildly.
native haskell code is tens to thousands times slower than C code (just recall that real programs use type classes and monads in addition to laziness)
Okay, tens is realistic, but thousands? Of course if you compare a tight loop that doesn't allocate to creating thousands of millions of cons-cells... Just because lists are easier to use in Haskell than in any other language I know doesn't mean it's necessary to use lists when writing Haskell if other ways are more fitting for the goal. Just for the record, timings on my machine, gcc-3.3 vs. ghc-6.8.3: $ ./runtests Sums in C, first counting up, then down with -O0 -243309312 real 0m6.751s user 0m6.660s sys 0m0.020s -243309312 real 0m6.318s user 0m6.190s sys 0m0.000s with -O1 -243309312 real 0m2.533s user 0m2.530s sys 0m0.010s -243309312 real 0m1.744s user 0m1.700s sys 0m0.000s with -O2 -243309312 real 0m1.744s user 0m1.710s sys 0m0.000s -243309312 real 0m1.687s user 0m1.680s sys 0m0.000s with -O3 -243309312 real 0m1.753s user 0m1.720s sys 0m0.000s -243309312 real 0m1.701s user 0m1.700s sys 0m0.000s Sums in Haskell First compiled with -O2, then with -O2 -fvia-C -optc-O3 Using uvector -243309312 real 0m7.412s user 0m7.290s sys 0m0.000s -243309312 real 0m5.726s user 0m5.650s sys 0m0.000s Loop down with BangPatterns -243309312 real 0m4.789s user 0m4.750s sys 0m0.010s -243309312 real 0m4.561s user 0m4.470s sys 0m0.000s Loop down without BangPatterns -243309312 real 0m5.092s user 0m4.890s sys 0m0.000s -243309312 real 0m4.747s user 0m4.540s sys 0m0.010s Loop up (with BangPatterns) -243309312 real 0m5.511s user 0m5.320s sys 0m0.000s -243309312 real 0m4.449s user 0m4.410s sys 0m0.000s Using strict left fold -243309312 real 2m45.625s user 2m41.930s sys 0m0.260s -243309312 real 2m43.890s user 2m41.550s sys 0m0.280s Fully naive -243309312 real 2m45.657s user 2m42.980s sys 0m0.250s -243309312 real 2m42.403s user 2m40.160s sys 0m0.370s Done