
There is a new combined benchmark, "partial sums" that subsumes several earlier benchmarks and runs 9 different numerical calculations: http://haskell.org/hawiki/PartialSumsEntry I took the Clean entry and, since there were so many recursive computations, I made compute and altCompute higher order functions. No speed optimization work has been done yet, but it seems really fast. A new http://haskell.org/hawiki/KnucleotideEntry has been accepted, replacing the one with the memory leak. The speed is also better, but still poor, due to Data.Hashtable. GHC 6.6.x should eventually help. Also on the topic of poor speed: http://haskell.org/hawiki/NbodyEntry At the bottom of the page is Einar's really cool existing entry (open mutable records). At the top of the page is best effort to replace it, but it runs at almost the same speed. Which is very much slower (10x) than other entries, such as OCaml. It is odd that small scale numerics, such as the partial-sums functions, can be made fast but not medium scale numerics like the n-body integration. There are 5 fixed mass parameters and 30 variables (5*3 positions + 5*3 velocities). Addition, Subtraction, Multiplication, Division. And we can't get it going nearly as fast as it ought to run. Note: The garbage collector is not a factor (%GC time 0.9%). Like the Hashtable glitch, if this were part of real program then a small bit of c-code could be attached via FFI. Are there little known Haskell idioms for doing math? Do we have to write a recursive function with 30 separate parameters? Do we have to peek and poke unboxed doubles? Is there an undocumented -funleash-fortran parameter? Will this be fixable in Haskell' ("Haskell-Prime")?

haskell:
There is a new combined benchmark, "partial sums" that subsumes several earlier benchmarks and runs 9 different numerical calculations:
Ah! I had an entry too. I've posted it on the wiki. I was careful to watch that all loops are compiled into nice unboxed ones in the Core. It seems to run a little bit faster than your more abstracted code. Timings on the page. Also, -fasm seems to only be a benefit on the Mac, as you've pointed out previously. Maybe you could check the times on the Mac too? -- Don

Donald Bruce Stewart wrote:
haskell:
There is a new combined benchmark, "partial sums" that subsumes several earlier benchmarks and runs 9 different numerical calculations:
Ah! I had an entry too. I've posted it on the wiki. I was careful to watch that all loops are compiled into nice unboxed ones in the Core. It seems to run a little bit faster than your more abstracted code.
Timings on the page.
Also, -fasm seems to only be a benefit on the Mac, as you've pointed out previously. Maybe you could check the times on the Mac too?
-- Don
Yeah. I had not tried all the compiler options. Using -fasm is slower on this for me as well. I suspect that since your code will beat the entries that have been posted so far, so I thin you should submit it. Also, could you explain how to check the Core (un)boxing in a note on the (new?) wiki? I would be interested in learning that trick. -- Chris

haskell:
Donald Bruce Stewart wrote:
haskell:
There is a new combined benchmark, "partial sums" that subsumes several earlier benchmarks and runs 9 different numerical calculations:
Ah! I had an entry too. I've posted it on the wiki. I was careful to watch that all loops are compiled into nice unboxed ones in the Core. It seems to run a little bit faster than your more abstracted code.
Timings on the page.
Also, -fasm seems to only be a benefit on the Mac, as you've pointed out previously. Maybe you could check the times on the Mac too?
-- Don
Yeah. I had not tried all the compiler options. Using -fasm is slower on this for me as well. I suspect that since your code will beat the entries that have been posted so far, so I thin you should submit it.
ok, I'll submit it.
Also, could you explain how to check the Core (un)boxing in a note on the (new?) wiki? I would be interested in learning that trick.
Ah, i just do: ghc A.hs -O2 -ddump-simpl | less and then read the Core, keeping an eye on the functions I'm interested in, and checking they're compiling to the kind of loops I'd write by hand. This is particularly useful for the kinds of tight numeric loops used in some of the shootout entries. Cheers, Don

On 1/26/06, Donald Bruce Stewart
Ah, i just do: ghc A.hs -O2 -ddump-simpl | less and then read the Core, keeping an eye on the functions I'm interested in, and checking they're compiling to the kind of loops I'd write by hand. This is particularly useful for the kinds of tight numeric loops used in some of the shootout entries.
Cheers, Don
In that case could you describe the kind of loops you'd write by hand? Seriously. And perhaps typical problems/fixes when the compiler doesn't produce what you want. Thanks, Joel

Joel Koerwer wrote:
On 1/26/06, *Donald Bruce Stewart*
mailto:dons@cse.unsw.edu.au> wrote: Ah, i just do: ghc A.hs -O2 -ddump-simpl | less and then read the Core, keeping an eye on the functions I'm interested in, and checking they're compiling to the kind of loops I'd write by hand. This is particularly useful for the kinds of tight numeric loops used in some of the shootout entries.
Cheers, Don
In that case could you describe the kind of loops you'd write by hand?
See below for the pseudo-code loop and the Haskell version.
Seriously. And perhaps typical problems/fixes when the compiler doesn't produce what you want.
We don't have any fixes.
Thanks, Joel
More discussion and code is at http://haskell.org/hawiki/NbodyEntry The compiler produces code that runs 4 times slower than OCaml in our current best attempt at programming against a 40 element (IOUArray Int Double). The final programs speed is very architecture dependent, but more frustrating is that small referentially transparent changes to the source code produce up to factor-of-two fluctuations in run time. The small numeric functions in the shootout, where there is a recursive function with 1 or 2 parameters (Double's), perform quite well. But manipulating this medium number of Double's to model the solar system has been too slow. The main loop for the 5 planets looks quite simple in pseudo-c: deltaTime = 0.01 for (i=0 ; i<5; ++i) { "get mass m, position (x,y,z), velocity (vx,vy,vz) of particle number i" for (j=(i+1); j<5; ++j) { "get mass, position, velocity of particle j" dxyx = "position of i" - "position of j" mag = deltaTime /(length of dxyz)^3 "velocity of j" += "mass of i" * mag * dxyz "velocity of i" -= "mass of j" * mag * dxyz } "position of i" += deltaTime * "velocity of i" } Note that the inner loop "for j" starts a "j=(i+1)". The best performing Haskell code, for this loop, so far is: -- Offsets for each field x = 0; y = 1; z = 2; vx= 3; vy= 4; vz= 5; m = 6 -- This is the main code. Essentially all the time is spent here advance n = when (n > 0) $ updateVel 0 >> advance (pred n) where updateVel i = when (i <= nbodies) $ do let i' = (.|. shift i 3) im <- unsafeRead b (i' m) ix <- unsafeRead b (i' x) iy <- unsafeRead b (i' y) iz <- unsafeRead b (i' z) ivx <- unsafeRead b (i' vx) ivy <- unsafeRead b (i' vy) ivz <- unsafeRead b (i' vz) let updateVel' ivx ivy ivz j = ivx `seq` ivy `seq` ivz `seq` if j > nbodies then do unsafeWrite b (i' vx) ivx unsafeWrite b (i' vy) ivy unsafeWrite b (i' vz) ivz else do let j' = (.|. shiftL j 3) jm <- unsafeRead b (j' m) dx <- liftM (ix-) (unsafeRead b (j' x)) dy <- liftM (iy-) (unsafeRead b (j' y)) dz <- liftM (iz-) (unsafeRead b (j' z)) let distance = sqrt (dx*dx+dy*dy+dz*dz) mag = 0.01 / (distance * distance * distance) addScaled3 (3 .|. (shiftL j 3)) ( im*mag) dx dy dz let a = -jm*mag ivx' = ivx+a*dx ivy' = ivy+a*dy ivz' = ivz+a*dz updateVel' ivx' ivy' ivz' $! (j+1) updateVel' ivx ivy ivz $! (i+1) addScaled (shiftL i 3) 0.01 (3 .|. (shiftL i 3)) updateVel (i+1) -- Helper functions addScaled i a j | i `seq` a `seq` j `seq` False = undefined -- stricitfy addScaled i a j = do set i1 =<< liftM2 scale (unsafeRead b i1) (unsafeRead b j1) set i2 =<< liftM2 scale (unsafeRead b i2) (unsafeRead b j2) set i3 =<< liftM2 scale (unsafeRead b i3) (unsafeRead b j3) where scale old new = old + a * new i1 = i; i2 = succ i1; i3 = succ i2; j1 = j; j2 = succ j1; j3 = succ j2; addScaled3 i a jx jy jz | i `seq` a `seq` jx `seq` jy `seq` jz `seq` False = undefined addScaled3 i a jx jy jz = do set i1 =<< liftM (scale jx) (unsafeRead b i1) set i2 =<< liftM (scale jy) (unsafeRead b i2) set i3 =<< liftM (scale jz) (unsafeRead b i3) where scale new old = a * new + old i1 = i; i2 = succ i1; i3 = succ i2;

Thanks Chris. I was actually asking about analyzing Core output in general. I'm well aware of the problems we're having with the nbody entry. I'm convinced my list based version can go faster than it is now. That's why I was asking if Don could put together a few notes on how to optimize inner loops using -ddump-simpl and the resulting Core code. So, I guess my request is along the same lines as your earlier one:
Also, could you explain how to check the Core (un)boxing in a note on the (new?) wiki? I would be interested in learning that trick.
I would love to have a tutorial about common situations one comes across when looking at the Core output.

joelkoerwer:
Thanks Chris. I was actually asking about analyzing Core output in general. I'm well aware of the problems we're having with the nbody entry. I'm convinced my list based version can go faster than it is now. That's why I was asking if Don could put together a few notes on how to optimize inner loops using -ddump-simpl and the resulting Core code.
Here's a brief introduction. I intend to write up (on the performance page on the wiki) a list of things we've done to improve the shootout entries. N.B we're now the 3rd *fastest* language, behind C and only a little behind D (a C varient) !! Consider the partial sums problem: wiki: http://www.haskell.org/hawiki/PartialSumsEntry site: http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums&lang=ghc&id=2 What follows is a discussion of the steps I took to improve the performance of this code. Here's the naive translation of the Clean entry (which was fairly quick): Lots of math in a tight loop.
import System; import Numeric
main = do n <- getArgs >>= readIO . head let sums = loop 1 n 1 0 0 0 0 0 0 0 0 0 fn (s,t) = putStrLn $ (showFFloat (Just 9) s []) ++ "\t" ++ t mapM_ (fn :: (Double, String) -> IO ()) (zip sums names)
names = ["(2/3)^k", "k^-0.5", "1/k(k+1)", "Flint Hills", "Cookson Hills" , "Harmonic", "Riemann Zeta", "Alternating Harmonic", "Gregory"]
loop k n alt a1 a2 a3 a4 a5 a6 a7 a8 a9 | k > n = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ] | otherwise = loop (k+1) n (-alt) (a1 + (2/3) ** (k-1)) (a2 + k ** (-0.5)) (a3 + 1 / (k * (k + 1))) (a4 + 1 / (k*k*k * sin k * sin k)) (a5 + 1 / (k*k*k * cos k * cos k)) (a6 + 1 / k) (a7 + 1 / (k*k)) (a8 + alt / k) (a9 + alt / (2 * k - 1))
Compiled with "-O2". However, the performance is _really_ bad :/ Somewhere greater than 128M heap, in fact eventually running out of memory on my laptop. A classic space leak. (2) So look at the generated core. "ghc -o naive Naive.hs -O2 -ddump-simpl | less" And we find that our loop has the following type:
$sloop_r2U6 :: GHC.Prim.Double# -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Prim.Double# -> [GHC.Float.Double]
Hmm. Ok, I certainly don't want boxed doubles in such a tight loop. (3) My next step is to encourage GHC to unbox this loop, by providing some strictness annotations. Now the loop looks like this:
loop k n alt a1 a2 a3 a4 a5 a6 a7 a8 a9 | () !k !n !False = undefined | k > n = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ] | otherwise = loop (k+1) n (-alt) (a1 + (2/3) ** (k-1)) (a2 + k ** (-0.5)) (a3 + 1 / (k * (k + 1))) (a4 + 1 / (k*k*k * sin k * sin k)) (a5 + 1 / (k*k*k * cos k * cos k)) (a6 + 1 / k) (a7 + 1 / (k*k)) (a8 + alt / k) (a9 + alt / (2 * k - 1)) where x ! y = x `seq` y
I've played a little game here, using ! for `seq`, reminiscent of the new !-pattern proposal for strictness. Let's see how this compiles. Here's the Core:
$sloop_r2Vh :: GHC.Prim.Double# -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Float.Double -> GHC.Prim.Double# -> [GHC.Float.Double]
Ok, so it unboxed one extra argument. Let's see if we can get them all unboxed. Strictify all args, and GHC produces an inner loop of:
$sloop_r2WS :: GHC.Prim.Double# -> GHC.Prim.Double# -> GHC.Prim.Double# -> GHC.Prim.Double# -> GHC.Prim.Double# -> GHC.Prim.Double# -> GHC.Prim.Double# -> GHC.Prim.Double# -> GHC.Prim.Double# -> GHC.Prim.Double# -> GHC.Prim.Double# -> GHC.Prim.Double# -> [GHC.Float.Double]
Ah! perfect. Let's see how that runs: $ ghc Naive.hs -O2 -no-recomp $ time ./a.out 2500000 3.000000000 (2/3)^k 3160.817621887 k^-0.5 0.999999600 1/k(k+1) 30.314541510 Flint Hills 42.995233998 Cookson Hills 15.309017155 Harmonic 1.644933667 Riemann Zeta 0.693146981 Alternating Harmonic 0.785398063 Gregory ./a.out 2500000 4.45s user 0.02s system 99% cpu 4.482 total (4) Not too bad. No space leak and quite zippy. But let's see what more can be done. First, double arithmetic usually benefits from -fexcess-precision, and cranking up the flags to gcc: paprika$ ghc Naive.hs -O2 -fexcess-precision -optc-O3 -optc-ffast-math -no-recomp paprika$ time ./a.out 2500000 3.000000000 (2/3)^k 3160.817621887 k^-0.5 0.999999600 1/k(k+1) 30.314541510 Flint Hills 42.995233998 Cookson Hills 15.309017155 Harmonic 1.644933667 Riemann Zeta 0.693146981 Alternating Harmonic 0.785398063 Gregory ./a.out 2500000 3.71s user 0.01s system 99% cpu 3.726 total Even better. Now, let's dive into the Core to see if there are any optimisation opportunites that GHC missed. So add -ddump-simpl and peruse the output. (5) Looking at the Core, I see firstly that some of the common subexpressions haven't been factored out: case [GHC.Float.Double] GHC.Prim./## 1.0 (GHC.Prim.*## (GHC.Prim.*## (GHC.Prim.*## (GHC.Prim.*## sc10_s2VS sc10_s2VS) sc10_s2VS) (GHC.Prim.sinDouble# sc10_s2VS)) (GHC.Prim.sinDouble# sc10_s2VS)) Multiple calls to sin. Hmm :/ And similar for cos, as well as k*k. Not sure why GHC isn't removing these (SimonM?), so let's do that by hand, and the inner loop looks like:
loop k n alt a1 a2 a3 a4 a5 a6 a7 a8 a9 | () !k !n !alt !a1 !a2 !a3 !a4 !a5 !a6 !a7 !a8 !a9 !False = undefined | k > n = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ] | otherwise = loop (k+1) n (-alt) (a1 + (2/3) ** (k-1)) (a2 + k ** (-0.5)) (a3 + 1 / (k * (k + 1))) (a4 + 1 / (k3 * sk * sk)) (a5 + 1 / (k3 * ck * ck)) (a6 + 1 / k) (a7 + 1 / k2) (a8 + alt / k) (a9 + alt / (2 * k - 1)) where sk = sin k ; ck = cos k; k2 = k * k; k3 = k2 * k; x ! y = x `seq` y
looking at the Core shows the sins are now allocated and shared:
let a9_s2MI :: GHC.Prim.Double# a9_s2MI = GHC.Prim.sinDouble# sc10_s2Xa
So the common expressions are floated out, and it now runs: paprika$ time ./a.out 2500000 3.000000000 (2/3)^k 3160.817621887 k^-0.5 0.999999600 1/k(k+1) 30.314541510 Flint Hills 42.995233998 Cookson Hills 15.309017155 Harmonic 1.644933667 Riemann Zeta 0.693146981 Alternating Harmonic 0.785398063 Gregory ./a.out 2500000 3.29s user 0.00s system 99% cpu 3.290 total Faster. So we gained 12% by floating out those common expressions. (6) Finally, another trick. When I checked the C entry, it used an integer for the k parameter to the loop, and cast it to a double for the math each time around, so perhaps we can make it an Int parameter. And secondly, the alt parameter only has it's sign flipped each time, so perhaps we can factor out the alt / k arg (it's either 1 / k or -1 on k), saving a division. So the final loop looks like:
loop i n alt a1 a2 a3 a4 a5 a6 a7 a8 a9 | () !i !n !alt !a1 !a2 !a3 !a4 !a5 !a6 !a7 !a8 !a9 !False = undefined | k > n = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ] | otherwise = loop (i+1) n (-alt) (a1 + (2/3) ** (k-1)) (a2 + k ** (-0.5)) (a3 + 1 / (k * (k + 1))) (a4 + 1 / (k3 * sk * sk)) (a5 + 1 / (k3 * ck * ck)) (a6 + dk) (a7 + 1 / k2) (a8 + alt * dk) (a9 + alt / (2 * k - 1)) where k = fromIntegral i dk = 1/k sk = sin k ck = cos k k2 = k * k k3 = k2 * k x ! y = x `seq` y
Checking the generated C code shows that the same operations are generated as the C entry. And it runs: $ time ./a.out 2500000 3.000000000 (2/3)^k 3160.817621887 k^-0.5 0.999999600 1/k(k+1) 30.314541510 Flint Hills 42.995233998 Cookson Hills 15.309017155 Harmonic 1.644933667 Riemann Zeta 0.693146981 Alternating Harmonic 0.785398063 Gregory ./a.out 2500000 3.17s user 0.01s system 99% cpu 3.206 total This entry in fact runs faster than the original (though not the new vectorised entry) optimised C entry (and faster than all other languages): http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums&lang=all So, by carefully tweaking things, we first squished a space leak, and then gained another 30%. In summary: * Check the Core that is generated * Watch out for optimisations that are missed * Read the generated C for the tight loops. * Make sure tight loops are unboxed * Use -fexcess-precision and -optc-ffast-math for doubles This is roughly the process I used for the other shootout entries. Cheers, Don

Donald Bruce Stewart wrote:
This entry in fact runs faster than the original (though not the new vectorised entry) optimised C entry (and faster than all other languages): http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums&lang=all
So, by carefully tweaking things, we first squished a space leak, and then gained another 30%.
In summary: * Check the Core that is generated * Watch out for optimisations that are missed * Read the generated C for the tight loops. * Make sure tight loops are unboxed * Use -fexcess-precision and -optc-ffast-math for doubles
This is roughly the process I used for the other shootout entries.
Cheers, Don
I just looked hard at the "new vectorised entry" and the original entry for C. In both, the last two functions, which use the alt-ernating sign, are *not* done in the required naive fashion:
sum = 0.0; for (k = 1; k <= n-1; k += 2) sum += 1.0/kd; for (k = 2; k <= n; k += 2) sum -= 1.0/kd; printf("%.9f\tAlternating Harmonic\n", sum);
sum = 0.0; for (k = 1; k <= 2*n-1; k += 4) sum += 1.0/kd; for (k = 3; k <= 2*n; k += 4) sum -= 1.0/kd; printf("%.9f\tGregory\n", sum);
As you can see, all the positive terms are added to sum, then all the negative terms. The double precision math comes to a different result, but this is hidden by printing only 9 digits. I just modified the g++ entry and the Haskell entry which do it right and the c entry to print more digits (e.g. "show sum" in Haskell). The Haskell entry and g++ entry agree, as expected. The c entry does *not* agree: AltHarm 0.6931469805600938 Haskell 0.69314698056009382831632592569803819060325622558593750000... g++ 0.69314698056038037687898167860112152993679046630859375000... gcc Gregory 0.7853980633974356 Haskell 0.7853980633974355640702924574725329875946044921875000000... g++ 0.7853980633973864922126040255534462630748748779296875000... gcc The gcc entry is computing a different answer since it uses the wrong order for making the partial sum. -- Chris

Hello Donald, Wednesday, February 01, 2006, 8:00:04 AM, you wrote: DBS> Here's a brief introduction. I intend to write up (on the performance page on DBS> the wiki) a list of things we've done to improve the shootout entries. N.B DBS> we're now the 3rd *fastest* language, behind C and only a little behind D (a C DBS> varient) !! 3rd fastest or 3rd overall, counting program lines and so on? -- Best regards, Bulat mailto:bulatz@HotPOP.com

Bulat Ziganshin wrote:
Hello Donald,
Wednesday, February 01, 2006, 8:00:04 AM, you wrote: DBS> Here's a brief introduction. I intend to write up (on the performance page on DBS> the wiki) a list of things we've done to improve the shootout entries. N.B DBS> we're now the 3rd *fastest* language, behind C and only a little behind D (a C DBS> varient) !!
3rd fastest or 3rd overall, counting program lines and so on?
That's the unconceivable thing. It is 3rd fastest. Looking at just Full CPU Time: C gcc 35.90 3 D Digital Mars 32.72 3 Haskell GHC 30.25 0 SML MLton 28.72 3 OCaml 27.92 1 Eiffel Smart 26.17 6 C++ g++ 25.73 3 Nice 24.43 4 Ada 95 GNAT 23.45 4 Clean 23.32 7 Java JDK 1.4 -server 22.69 5 Java JDK -server 22.39 5 Java JDK -client 19.19 5 C# Mono 16.99 2 Only C gcc and D Digital Mars are ahead. Looking at Just Memory Use, Haskell is 8th language score missing C gcc 39.00 3 D Digital Mars 29.21 3 Forth GForth 28.63 2 Ada 95 GNAT 27.12 4 Pascal Free 26.53 7 Eiffel Smart 24.53 6 C++ g++ 24.46 3 Haskell GHC 24.28 0 OCaml 21.55 1 Fortran G95 20.21 6 Lua 19.63 2 SML MLton 17.80 3 Looking at Just Lines Of Code, Haskell is 1st by a mile: Haskell GHC 41.84 0 SML MLton 34.47 3 Forth GForth 32.50 2 OCaml 30.86 1 Tcl 30.83 3 Python Psyco 30.49 0 Python 30.33 1 Lua 29.17 2 Ruby 27.69 4 Perl 25.50 5 Nice 25.09 4 C# Mono 24.59 2 D Digital Mars 22.79 3 C++ g++ 22.60 3 Java JDK -client 21.12 5 Java JDK -Xint 21.12 5 Java JDK 1.4 -server 21.12 5 Java JDK -server 21.12 5 C gcc 20.98 3 Where I had to include lots of languages to get down to C gcc. Lookat at the 1:1:1 even balance of the above three, Haskell is 1st: Haskell GHC 96.37 0 C gcc 95.87 3 D Digital Mars 84.72 3 SML MLton 81.00 3 OCaml 80.33 1 Forth GForth 75.17 2 C++ g++ 72.79 3

--- Chris Kuklewicz
It is 3rd fastest. Looking at Just Memory Use, Haskell is 8th Looking at Just Lines Of Code, Haskell is 1st Lookat at the 1:1:1 even balance Haskell is 1st
Programmer skill and effort really does matter ;-) Congratulations. __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com

Isaac Gouy
Programmer skill and effort really does matter ;-)
Yes, more so, than any inherent language disadvantage, perhaps, which happens to be the general lesson from the ICFP contests as well. Any idea if other languages have seen similar efforts? -k -- If I haven't seen further, it is by standing in the footprints of giants

--- Ketil Malde
Isaac Gouy
writes: Programmer skill and effort really does matter ;-)
Yes, more so, than any inherent language disadvantage, perhaps, which happens to be the general lesson from the ICFP contests as well. Any idea if other languages have seen similar efforts?
FreePascal and Smart Effiel, somewhat - and there have been excellent individual efforts with Lua and Tcl and ... imo the Haskell Cafe discussions and wiki have been a more open and shared learning experience than we usually see, and maybe some of the success stems from that collaboration and competition. __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com

Don, that's a great little mini tutorial, exactly what I was hoping for. I'm looking forward to learning more tricks. On an unrelated note, I have an STUArray nbody. I haven't really looked closely at the chris+dons version, but I suspect they amount to doing the same thing. I get commensurate runtimes at least. But I'll post it on the wiki in a while in case there is some optimization I missed. On an even more unrelated note, I get slower runtimes with -optc-O3 and -optc-ffast-math than without. Individually or in tandem. Odd. This is ghc 6.4.1 and gcc 4.0.3 on a (Banias) Pentium M.

Joel Koerwer wrote:
Don, that's a great little mini tutorial, exactly what I was hoping for. I'm looking forward to learning more tricks.
On an unrelated note, I have an STUArray nbody. I haven't really looked closely at the chris+dons version, but I suspect they amount to doing the same thing. I get commensurate runtimes at least. But I'll post it on the wiki in a while in case there is some optimization I missed.
On an even more unrelated note, I get slower runtimes with -optc-O3 and -optc-ffast-math than without. Individually or in tandem. Odd. This is ghc 6.4.1 and gcc 4.0.3 on a (Banias) Pentium M.
More architecture benchmarking: On a powerbook G4, Joel Koerwer's entry on http://haskell.org/hawiki/NbodyEntry runs faster than dons+chris. And I edited it to make it smaller, but by hoisting 'size' and 'dt' I also made it run quite a bit faster. It now takes 1.7x less time than dons+chris. Don, could you check the speed on your architecture? -- Chris

haskell:
Joel Koerwer wrote:
Don, that's a great little mini tutorial, exactly what I was hoping for. I'm looking forward to learning more tricks.
On an unrelated note, I have an STUArray nbody. I haven't really looked closely at the chris+dons version, but I suspect they amount to doing the same thing. I get commensurate runtimes at least. But I'll post it on the wiki in a while in case there is some optimization I missed.
On an even more unrelated note, I get slower runtimes with -optc-O3 and -optc-ffast-math than without. Individually or in tandem. Odd. This is ghc 6.4.1 and gcc 4.0.3 on a (Banias) Pentium M.
More architecture benchmarking:
On a powerbook G4, Joel Koerwer's entry on http://haskell.org/hawiki/NbodyEntry runs faster than dons+chris.
And I edited it to make it smaller, but by hoisting 'size' and 'dt' I also made it run quite a bit faster. It now takes 1.7x less time than dons+chris.
Don, could you check the speed on your architecture?
Yes! I was missing the -funbox-strict-fields, it makes a huge difference. Check the wiki page, these stuarrays are the fastest yet. Cheers, Don

Hey this is great. Chris your improvements are awesome. I mean the speed is nice, but you really cleaned up the code. There's an extraneous call to energy in the second runST block, but it should be insignificant. Also, -fglasgow-exts is necessary for the left-hand-side type declarations of size and dt. One question, in: calcMomentum (i+1) $! (px+vx*m,py+vy*m,pz+vz*m) that $! doesn't actually do much for a tuple, does it? Of course, there's not much point in further optimizing the initialization routine, as we'd never be able to detect the difference in runtime. The shootout has been a great learning tool for me :-) Thanks to Chris, Don, and the rest of the Haskell community. Joel

Donald Bruce Stewart wrote:
haskell:
Donald Bruce Stewart wrote:
haskell:
There is a new combined benchmark, "partial sums" that subsumes several earlier benchmarks and runs 9 different numerical calculations:
Ah! I had an entry too. I've posted it on the wiki. I was careful to watch that all loops are compiled into nice unboxed ones in the Core. It seems to run a little bit faster than your more abstracted code.
Timings on the page.
Also, -fasm seems to only be a benefit on the Mac, as you've pointed out previously. Maybe you could check the times on the Mac too?
-- Don
Yeah. I had not tried all the compiler options. Using -fasm is slower on this for me as well. I suspect that since your code will beat the entries that have been posted so far, so I thin you should submit it.
ok, I'll submit it.
Also, could you explain how to check the Core (un)boxing in a note on the (new?) wiki? I would be interested in learning that trick.
Ah, i just do: ghc A.hs -O2 -ddump-simpl | less and then read the Core, keeping an eye on the functions I'm interested in, and checking they're compiling to the kind of loops I'd write by hand. This is particularly useful for the kinds of tight numeric loops used in some of the shootout entries.
Some comments on this: I couldn't get it to go any faster (1-2% is all, with some really ugly hacks). It comes down to good low-level loop optimisation, which GHC doesn't do. You could improve things by passing the array around rather than having it as a global, because then it can be unpacked - make sure you seq the array in the right places, check the Core to be sure. I didn't try this, and it might only improve things marginally. -fexcess-precision is required when compiling via C. It should only be necessary on x86, but 6.4.1 and earlier require it on all platforms (we fixed that recently). gcc -O2 is about 15% better than -fasm on x86_64 here. Cheers, Simon
participants (7)
-
Bulat Ziganshin
-
Chris Kuklewicz
-
dons@cse.unsw.edu.au
-
Isaac Gouy
-
Joel Koerwer
-
Ketil Malde
-
Simon Marlow