Re: [Haskell-cafe] Haskell version of ray tracer code is much slower than the original ML

On Thu, Jun 21, 2007 at 12:25:44PM +0100, Sebastian Sylvan wrote:
Try using floats for the vector, and strict fields (add a ! to the fields in the data declaration).
Because the optimisation page on the haskell wiki is very explicit about never using Float when you can use Double, that's why. An older revision used Float and it was slower than the current one. Making the datatypes strict also makes no difference. I have tried the obvious things :)
That's the simplest possible thing I can think of after about two seconds of looking anyway. It appears that the ML version uses float so I don't understand why you would use Double for the Haskell version at all, and then think you could do any sort of valid comparisons between them...
OCaML floats are Doubles, at least on x86. cheers, Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

Hello Philip, Thursday, June 21, 2007, 3:36:27 PM, you wrote:
revision used Float and it was slower than the current one. Making the datatypes strict also makes no difference.
don't forget to use either -funpack-strict-fields or {#- UNPACK -#} pragma -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Thu, Jun 21, 2007 at 04:23:37PM +0400, Bulat Ziganshin wrote:
Thursday, June 21, 2007, 3:36:27 PM, you wrote:
revision used Float and it was slower than the current one. Making the datatypes strict also makes no difference.
don't forget to use either -funpack-strict-fields or {#- UNPACK -#} pragma
ahem: http://www.kantaka.co.uk/cgi-bin/darcsweb.cgi?r=ray;a=headblob;f=/Makefile (Assuming you mean -funbox-strict-fields; -funpack-strict-fields doesn't appear in the ghc 6.6.1 makefile as far as I can see.) As I said, I've tried the obvious things & they didn't make any difference. Now I could go sprinkling $!, ! and seq around like confetti but that seems like giving up really. Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

On Thu, Jun 21, 2007 at 01:45:04PM +0100, Philip Armstrong wrote:
As I said, I've tried the obvious things & they didn't make any difference. Now I could go sprinkling $!, ! and seq around like confetti but that seems like giving up really.
OK. Looks like I was mistaken. Strictness annotations *do* make a difference! Humph. Wonder what I was doing wrong yesterday? Anyway timings follow, with all strict datatypes in the Haskell version: Langauge File Time in seconds Haskell ray.hs 38.2 OCaml ray.ml 23.8 g++-4.1 ray.cpp 12.6 (ML & C++ Code from http://www.ffconsultancy.com/languages/ray_tracer/comparison.html) Gcc seems to have got quite a bit better since Jon last benchmarked this code. Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

Hello Philip, Friday, June 22, 2007, 7:36:51 PM, you wrote:
Langauge File Time in seconds Haskell ray.hs 38.2 OCaml ray.ml 23.8 g++-4.1 ray.cpp 12.6
can you share sourcecode of this variant? i'm interested to see how much it is obfuscated btw, *their* measurement said that ocaml is 7% faster :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Fri, Jun 22, 2007 at 10:11:27PM +0400, Bulat Ziganshin wrote:
Friday, June 22, 2007, 7:36:51 PM, you wrote:
Langauge File Time in seconds Haskell ray.hs 38.2 OCaml ray.ml 23.8 g++-4.1 ray.cpp 12.6
can you share sourcecode of this variant? i'm interested to see how much it is obfuscated
http://www.kantaka.co.uk/darcs/ray The cpp & ml versions are precisely those available from the download links on http://www.ffconsultancy.com/languages/ray_tracer/comparison.html The optimisation options I used can be seen in the makefile.
btw, *their* measurement said that ocaml is 7% faster :)
Indeed. The gcc-4.0 compilied binary runs at about 15s IIRC, but it's still much better than 7% faster than the ocaml binary. cheers, Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

try making ray_sphere and intersect' local to intersect, then drop their constant ray parameter. saves me 25%. claus

On Sat, Jun 23, 2007 at 12:42:33AM +0100, Claus Reinke wrote:
try making ray_sphere and intersect' local to intersect, then drop their constant ray parameter. saves me 25%. claus
I see: I guess I'm paying for laziness in the first parameter to intersect' which I don't need. 25% brings it within spitting distance of the OCaml binary too. Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

On Saturday 23 June 2007 08:54:11 Philip Armstrong wrote:
On Sat, Jun 23, 2007 at 12:42:33AM +0100, Claus Reinke wrote:
try making ray_sphere and intersect' local to intersect, then drop their constant ray parameter. saves me 25%. claus
I see: I guess I'm paying for laziness in the first parameter to intersect' which I don't need. 25% brings it within spitting distance of the OCaml binary too.
Can you post the code for this? -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. The OCaml Journal http://www.ffconsultancy.com/products/ocaml_journal/?e

try making ray_sphere and intersect' local to intersect, then drop their constant ray parameter. saves me 25%. claus
also try replacing that (foldl' intersect') with (foldr (flip intersect'))! using a recent ghc head instead of ghc-6.6.1 also seems to make a drastic difference (wild guess, seeing the unroll 1000 for ocaml: has there been a change to default unrolling in ghc?). claus

On Saturday 23 June 2007 12:05:01 Claus Reinke wrote:
try making ray_sphere and intersect' local to intersect, then drop their constant ray parameter. saves me 25%. claus
also try replacing that (foldl' intersect') with (foldr (flip intersect'))!
using a recent ghc head instead of ghc-6.6.1 also seems to make a drastic difference (wild guess, seeing the unroll 1000 for ocaml: has there been a change to default unrolling in ghc?).
Wow! Now I get: GHC: 15.8s OCaml: 14s g++: 10s That's very impressive, and more than I was expecting. I think it is worth noting that you are now comparing optimized Haskell to unoptimized OCaml though, so the OCaml has more "low-hanging optimization fruit". :-) At this point, the single most productive optimization for the OCaml is to evade the polymorphism and closure in the "intersect" function by inlining the call to "List.fold_left". This makes the OCaml basically as fast as the C++ on my machine: GHC: 15.8s OCaml: 10.6s g++: 10s let delta = sqrt epsilon_float type vec = {x:float; y:float; z:float} let zero = {x=0.; y=0.; z=0.} let ( *| ) s r = {x = s *. r.x; y = s *. r.y; z = s *. r.z} let ( +| ) a b = {x = a.x +. b.x; y = a.y +. b.y; z = a.z +. b.z} let ( -| ) a b = {x = a.x -. b.x; y = a.y -. b.y; z = a.z -. b.z} let dot a b = a.x *. b.x +. a.y *. b.y +. a.z *. b.z let length r = sqrt(dot r r) let unitise r = 1. /. length r *| r let rec intersect orig dir (l, _ as hit) (center, radius, scene) = let l' = let v = center -| orig in let b = dot v dir in let disc = sqrt(b *. b -. dot v v +. radius *. radius) in let t1 = b -. disc and t2 = b +. disc in if t2>0. then if t1>0. then t1 else t2 else infinity in if l' >= l then hit else match scene with | [] -> l', unitise (orig +| l' *| dir -| center) | scenes -> intersects orig dir hit scenes and intersects orig dir hit = function | [] -> hit | scene::scenes -> intersects orig dir (intersect orig dir hit scene) scenes let light = unitise {x=1.; y=3.; z= -2.} and ss = 4 let rec ray_trace dir scene = let l, n = intersect zero dir (infinity, zero) scene in let g = dot n light in if g <= 0. then 0. else let p = l *| dir +| sqrt epsilon_float *| n in if fst (intersect p light (infinity, zero) scene) < infinity then 0. else g let rec create level c r = let obj = c, r, [] in if level = 1 then obj else let a = 3. *. r /. sqrt 12. in let aux x' z' = create (level - 1) (c +| {x=x'; y=a; z=z'}) (0.5 *. r) in c, 3. *. r, [obj; aux (-.a) (-.a); aux a (-.a); aux (-.a) a; aux a a] let level, n = try int_of_string Sys.argv.(1), int_of_string Sys.argv.(2) with _ -> 6, 512 let scene = create level {x=0.; y= -1.; z=4.} 1.;; Printf.printf "P5\n%d %d\n255\n" n n;; for y = n - 1 downto 0 do for x = 0 to n - 1 do let g = ref 0. in for dx = 0 to ss - 1 do for dy = 0 to ss - 1 do let aux x d = float x -. float n /. 2. +. float d /. float ss in let dir = unitise {x=aux x dx; y=aux y dy; z=float n} in g := !g +. ray_trace dir scene done; done; let g = 0.5 +. 255. *. !g /. float (ss*ss) in Printf.printf "%c" (char_of_int (int_of_float g)) done; done -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. The OCaml Journal http://www.ffconsultancy.com/products/ocaml_journal/?e

On Sat, Jun 23, 2007 at 12:05:01PM +0100, Claus Reinke wrote:
try making ray_sphere and intersect' local to intersect, then drop their constant ray parameter. saves me 25%. claus
also try replacing that (foldl' intersect') with (foldr (flip intersect'))!
Thanks guys, this is exactly the kind of advice I was seeking. OK, next question: Given that I'm using all the results from intersect', why is the lazy version better than the strict one? Is ghc managing to do some loop fusion?
using a recent ghc head instead of ghc-6.6.1 also seems to make a drastic difference (wild guess, seeing the unroll 1000 for ocaml: has there been a change to default unrolling in ghc?).
Um. I tried ghc head on the current version and it was about 15% *slower* than 6.6.1 Perhaps it does better on the (slightly) optimised version? Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

On Sat, Jun 23, 2007 at 07:07:49PM +0100, Philip Armstrong wrote:
On Sat, Jun 23, 2007 at 12:05:01PM +0100, Claus Reinke wrote:
try making ray_sphere and intersect' local to intersect, then drop their constant ray parameter. saves me 25%. claus
also try replacing that (foldl' intersect') with (foldr (flip intersect'))!
Thanks guys, this is exactly the kind of advice I was seeking.
OK, next question: Given that I'm using all the results from intersect', why is the lazy version better than the strict one? Is ghc managing to do some loop fusion?
Incidentally, replacing foldl' (intersect ray) hit scene with foldr (flip (intersect ray)) hit scene makes the current version (without the lifting of ray out of intersect & ray_scene) almost exactly as fast as the OCaml version on my hardware. That's almost a 40% speedup!
using a recent ghc head instead of ghc-6.6.1 also seems to make a drastic difference (wild guess, seeing the unroll 1000 for ocaml: has there been a change to default unrolling in ghc?).
Um. I tried ghc head on the current version and it was about 15% *slower* than 6.6.1
Perhaps it does better on the (slightly) optimised version?
Nope, just tried it on the foldr version. It's still slower than 6.6.1 (although not by as much: 26s vs 24s for the 6.6.1 binary). This is ghc head from this Friday. Jon is probably correct that hoisting ray out is a code level transformation that makes the Haskell version different to the OCaml one, but personally I'd suggest that replacing foldl with foldr is not: the end result is the same & both have to walk the entire list in order to get a result. So, on my hardware at least, the Haskell and OCaml version have equivalent performance. I think that's pretty impressive. Getting close to the C++ would probably going to require rather more effort! Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

also try replacing that (foldl' intersect') with (foldr (flip intersect'))! OK, next question: Given that I'm using all the results from intersect', why is the lazy version better than the strict one? Is ghc managing to do some loop fusion?
haskell tends to prefer foldr where mls prefer foldl, be it for lazyness and short-circuiting operators, or because a tail-recursive function with a lazy accumulator is only an efficient way to construct inefficient expressions. so, the very first thing i tend to look for when someone ports a program from ml to haskell are tail recursions with non-strict accumulators. even using foldl', when constructing pairs in the accumulator, there's no guarantee that the pair components will be evaluated early even if the pairs themselves are forced. so replacing foldl with foldr when porting from ml to haskell tends to be a useful habit, unless there are good reasons for foldl. however, things seem to be a little bit more involved in this example: intersect' forces the first component, and ignores the second, never building nasty delayed combinations of old accumulator and list head in the new accumulator. but if you compare the outputs of -ddump-simpl, or if you export all definitions from main and compare the outputs of --show-iface, you'll find differences related to the the result of intersect': whether or not that result can be passed unboxed. Constructed Product Result Analysis for Haskell (2000) http://citeseer.ist.psu.edu/baker-finch00constructed.html i don't know the details, but in the foldr case, the result of intersect' seems to be passed unboxed, in the foldl' case, it isn't. i'll leave it to the experts to explain whether that has to be the case or whether it is an omission in the optimizer. claus
using a recent ghc head instead of ghc-6.6.1 also seems to make a drastic difference
$ uname -a CYGWIN_NT-5.1 cr3-lt 1.5.19(0.150/4/2) 2006-01-20 13:28 i686 Cygwin $ ghc --version The Glorious Glasgow Haskell Compilation System, version 6.6.1 $ gcc --version gcc.exe (GCC) 3.4.2 (mingw-special) Copyright (C) 2004 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. $ ghc --make Ray1.hs [1 of 1] Compiling Main ( Ray1.hs, Ray1.o ) Linking Ray1.exe ... $ time ./Ray1.exe >out real 0m55.705s user 0m0.015s sys 0m0.031s $ /cygdrive/c/fptools/ghc/ghc-6.7.20070613/bin/ghc --make Ray1.hs -o Ray1head.exe [1 of 1] Compiling Main ( Ray1.hs, Ray1.o ) Linking Ray1head.exe ... $ time ./Ray1head.exe >out.head real 0m24.989s user 0m0.031s sys 0m0.015s $ diff -q --binary out out.head $

On Friday 22 June 2007 19:54:16 Philip Armstrong wrote:
On Fri, Jun 22, 2007 at 10:11:27PM +0400, Bulat Ziganshin wrote:
btw, *their* measurement said that ocaml is 7% faster :)
Indeed. The gcc-4.0 compilied binary runs at about 15s IIRC, but it's still much better than 7% faster than the ocaml binary.
What architecture, platform, compiler versions and compile lines are you using? On my 2x 2.2GHz Athlon64 running x64 Debian I now get: GHC 6.6.1: 26.5s ghc -funbox-strict-fields -O3 ray.hs -o ray OCaml 3.10.0: 14.158s ocamlopt -inline 1000 ray.ml -o ray g++ 4.1.3: 8.056s g++ -O3 -ffast-math ray.cpp -o ray Also, the benchmarks and results that I cited before are more up to date than the ones you're using. In particular, you might be interested in these faster versions: http://www.ffconsultancy.com/languages/ray_tracer/code/5/ray.ml http://www.ffconsultancy.com/languages/ray_tracer/code/5/ray.cpp For "./ray 6 512", I get: OCaml: 3.140s ocamlopt -inline 1000 ray.ml -o ray C++: 2.970s g++ -O3 -ffast-math ray.cpp -o ray -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. The OCaml Journal http://www.ffconsultancy.com/products/ocaml_journal/?e

jon:
On Friday 22 June 2007 19:54:16 Philip Armstrong wrote:
On Fri, Jun 22, 2007 at 10:11:27PM +0400, Bulat Ziganshin wrote:
btw, *their* measurement said that ocaml is 7% faster :)
Indeed. The gcc-4.0 compilied binary runs at about 15s IIRC, but it's still much better than 7% faster than the ocaml binary.
What architecture, platform, compiler versions and compile lines are you using?
On my 2x 2.2GHz Athlon64 running x64 Debian I now get:
GHC 6.6.1: 26.5s ghc -funbox-strict-fields -O3 ray.hs -o ray
Don't use -O3 , its *worse* than -O2, and somewhere between -Onot and -O iirc, ghc -O2 -funbox-strict-fields -fvia-C -optc-O2 -optc-ffast-math -fexcess-precision Are usually fairly good.
OCaml 3.10.0: 14.158s ocamlopt -inline 1000 ray.ml -o ray g++ 4.1.3: 8.056s g++ -O3 -ffast-math ray.cpp -o ray
Also, the benchmarks and results that I cited before are more up to date than the ones you're using. In particular, you might be interested in these faster versions:
http://www.ffconsultancy.com/languages/ray_tracer/code/5/ray.ml http://www.ffconsultancy.com/languages/ray_tracer/code/5/ray.cpp
For "./ray 6 512", I get:
OCaml: 3.140s ocamlopt -inline 1000 ray.ml -o ray C++: 2.970s g++ -O3 -ffast-math ray.cpp -o ray
-- Dr Jon D Harrop, Flying Frog Consultancy Ltd. The OCaml Journal http://www.ffconsultancy.com/products/ocaml_journal/?e

Donald Bruce Stewart wrote:
Don't use -O3 , its *worse* than -O2, and somewhere between -Onot and -O iirc,
ghc -O2 -funbox-strict-fields -fvia-C -optc-O2 -optc-ffast-math -fexcess-precision
Are usually fairly good.
Is this likely to be fixed ever?

On Sat, Jun 23, 2007 at 08:49:15AM +0100, Andrew Coppin wrote:
Donald Bruce Stewart wrote:
Don't use -O3 , its *worse* than -O2, and somewhere between -Onot and -O iirc,
Is this likely to be fixed ever?
There is at least a bug report for it IIRC. Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

Philip Armstrong wrote:
On Sat, Jun 23, 2007 at 08:49:15AM +0100, Andrew Coppin wrote:
Donald Bruce Stewart wrote:
Don't use -O3 , its *worse* than -O2, and somewhere between -Onot and -O iirc,
Is this likely to be fixed ever?
There is at least a bug report for it IIRC.
It was fixed yesterday. Cheers, Simon

On Sat, Jun 23, 2007 at 03:28:53AM +0100, Jon Harrop wrote:
What architecture, platform, compiler versions and compile lines are you using?
32-bit x86, Debian unstable, gcc version 4.1.2, OCaml version 3.09.2-9, GHC version 6.6.1, compile line in the Makfile at http://www.kantaka.co.uk/darcs/ray/Makefile
http://www.ffconsultancy.com/languages/ray_tracer/code/5/ray.ml http://www.ffconsultancy.com/languages/ray_tracer/code/5/ray.cpp
I gather these use algorithmic optimisations. cheers, Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

On Saturday 23 June 2007 08:58:10 Philip Armstrong wrote:
On Sat, Jun 23, 2007 at 03:28:53AM +0100, Jon Harrop wrote:
What architecture, platform, compiler versions and compile lines are you using?
32-bit x86...
Intel or AMD?
http://www.ffconsultancy.com/languages/ray_tracer/code/5/ray.ml http://www.ffconsultancy.com/languages/ray_tracer/code/5/ray.cpp
I gather these use algorithmic optimisations.
Both. Versions 1-4 are progressively algorithmic, version 5 includes low-level optimizations (mostly manual inlining and unrolling). Implementing version 1 is probably also interesting: it is the most concise version. BTW, the ray tracer leverages the semantics of the float infinity (as the parameter when there is no intersection). Shouldn't be a problem but some compiler options might break it. -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. The OCaml Journal http://www.ffconsultancy.com/products/ocaml_journal/?e

On Sat, Jun 23, 2007 at 10:32:31AM +0100, Jon Harrop wrote:
On Saturday 23 June 2007 08:58:10 Philip Armstrong wrote:
On Sat, Jun 23, 2007 at 03:28:53AM +0100, Jon Harrop wrote:
What architecture, platform, compiler versions and compile lines are you using?
32-bit x86...
Intel or AMD?
AMD. Athlon 64 3000+ to be precise.
http://www.ffconsultancy.com/languages/ray_tracer/code/5/ray.ml http://www.ffconsultancy.com/languages/ray_tracer/code/5/ray.cpp
I gather these use algorithmic optimisations.
Both. Versions 1-4 are progressively algorithmic, version 5 includes low-level optimizations (mostly manual inlining and unrolling). Implementing version 1 is probably also interesting: it is the most concise version.
BTW, the ray tracer leverages the semantics of the float infinity (as the parameter when there is no intersection). Shouldn't be a problem but some compiler options might break it.
Thus far the Haskell version generates identical images to the OCaml one. Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

Awesome stuff! On Thursday 21 June 2007 12:36:27 Philip Armstrong wrote:
On Thu, Jun 21, 2007 at 12:25:44PM +0100, Sebastian Sylvan wrote:
Try using floats for the vector, and strict fields (add a ! to the fields in the data declaration).
Because the optimisation page on the haskell wiki is very explicit about never using Float when you can use Double, that's why. An older revision used Float and it was slower than the current one. Making the datatypes strict also makes no difference.
Where exactly do the !s go and what do they do?
That's the simplest possible thing I can think of after about two seconds of looking anyway. It appears that the ML version uses float so I don't understand why you would use Double for the Haskell version at all, and then think you could do any sort of valid comparisons between them...
OCaML floats are Doubles, at least on x86.
Yes. OCaml doesn't have a 32-bit float storage format, apart from an entirely-float big array. Also, the ML in OCaml doesn't stand for metalanguage. ;-) There is probably some benefit to laziness in this example because many of the spheres are occluded in the final image, so parts of the scene tree that are eagerly generated in the other implementations may actually never be traversed by the renderer. I take it you saw the whole language comparison: http://www.ffconsultancy.com/languages/ray_tracer/ I'll be uploading concurrent implementations ASAP. Haskell should do well there... :-) PS: You spelled my name wrong! -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. The OCaml Journal http://www.ffconsultancy.com/products/ocaml_journal/?e

On Thu, Jun 21, 2007 at 01:39:24PM +0100, Jon Harrop wrote:
Awesome stuff!
On Thursday 21 June 2007 12:36:27 Philip Armstrong wrote:
On Thu, Jun 21, 2007 at 12:25:44PM +0100, Sebastian Sylvan wrote:
Try using floats for the vector, and strict fields (add a ! to the fields in the data declaration).
Because the optimisation page on the haskell wiki is very explicit about never using Float when you can use Double, that's why. An older revision used Float and it was slower than the current one. Making the datatypes strict also makes no difference.
Where exactly do the !s go and what do they do?
On the datatypes: data Vector = V !Double !Double !Double for instance. They tell the compiler to make those fields strict rather than lazy. This may or may not help things...
OCaML floats are Doubles, at least on x86.
Yes. OCaml doesn't have a 32-bit float storage format, apart from an entirely-float big array. Also, the ML in OCaml doesn't stand for metalanguage. ;-)
Point!
I take it you saw the whole language comparison:
Yup. I'd been meaning to run off a haskell version for a while.
I'll be uploading concurrent implementations ASAP. Haskell should do well there... :-)
I'll be on the lookout.
PS: You spelled my name wrong!
Sorry! Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

On Thu, 2007-06-21 at 13:39 +0100, Jon Harrop wrote:
Awesome stuff!
On Thursday 21 June 2007 12:36:27 Philip Armstrong wrote:
On Thu, Jun 21, 2007 at 12:25:44PM +0100, Sebastian Sylvan wrote:
Try using floats for the vector, and strict fields (add a ! to the fields in the data declaration).
Because the optimisation page on the haskell wiki is very explicit about never using Float when you can use Double, that's why. An older revision used Float and it was slower than the current one. Making the datatypes strict also makes no difference.
Where exactly do the !s go and what do they do?
That's the simplest possible thing I can think of after about two seconds of looking anyway. It appears that the ML version uses float so I don't understand why you would use Double for the Haskell version at all, and then think you could do any sort of valid comparisons between them...
OCaML floats are Doubles, at least on x86.
Yes. OCaml doesn't have a 32-bit float storage format, apart from an entirely-float big array. Also, the ML in OCaml doesn't stand for metalanguage. ;-)
To be technical, it should be OCAML.

Philip Armstrong
Because the optimisation page on the haskell wiki is very explicit about never using Float when you can use Double, that's why. (snip)
Is that still true if you use -fexcess-precision ? -- Mark

On Thu, Jun 21, 2007 at 01:29:56PM -0400, Mark T.B. Carroll wrote:
Philip Armstrong
writes: (snip) Because the optimisation page on the haskell wiki is very explicit about never using Float when you can use Double, that's why. (snip)
Is that still true if you use -fexcess-precision ?
Why on earth would you use -fexcess-precision if you're using Floats? The excess precision only apples to Doubles held in registers on x86 IIRC. (If you spill a Double from a register to memory, then you lose the extra precision bits in the process). Unless -fexcess-precision with ghc does something completely different to the analogous gcc setting that is. Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

Philip Armstrong
Why on earth would you use -fexcess-precision if you're using Floats? The excess precision only apples to Doubles held in registers on x86 IIRC. (If you spill a Double from a register to memory, then you lose the extra precision bits in the process).
Some googling suggests that point 2 on http://www.haskell.org/hawiki/FasterFloatingPointWithGhc might have been what I was thinking of. -- Mark

On Thu, Jun 21, 2007 at 03:29:17PM -0400, Mark T.B. Carroll wrote:
Philip Armstrong
writes: (snip) Why on earth would you use -fexcess-precision if you're using Floats? The excess precision only apples to Doubles held in registers on x86 IIRC. (If you spill a Double from a register to memory, then you lose the extra precision bits in the process).
Some googling suggests that point 2 on http://www.haskell.org/hawiki/FasterFloatingPointWithGhc might have been what I was thinking of.
That's the old wiki. The new one gives the opposite advice! (As does the ghc manual): http://www.haskell.org/ghc/docs/latest/html/users_guide/faster.html http://www.haskell.org/haskellwiki/Performance/Floating_Point Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

On Thu, Jun 21, 2007 at 08:42:57PM +0100, Philip Armstrong wrote:
On Thu, Jun 21, 2007 at 03:29:17PM -0400, Mark T.B. Carroll wrote:
Philip Armstrong
writes: (snip) Why on earth would you use -fexcess-precision if you're using Floats? The excess precision only apples to Doubles held in registers on x86 IIRC. (If you spill a Double from a register to memory, then you lose the extra precision bits in the process).
Some googling suggests that point 2 on http://www.haskell.org/hawiki/FasterFloatingPointWithGhc might have been what I was thinking of.
That's the old wiki. The new one gives the opposite advice! (As does the ghc manual):
http://www.haskell.org/ghc/docs/latest/html/users_guide/faster.html http://www.haskell.org/haskellwiki/Performance/Floating_Point
Incidentally, the latter page implies that ghc is being overly pessimistic when compilling FP code without -fexcess-precision: "On x86 (and other platforms with GHC prior to version 6.4.2), use the -fexcess-precision flag to improve performance of floating-point intensive code (up to 2x speedups have been seen). This will keep more intermediates in registers instead of memory, at the expense of occasional differences in results due to unpredictable rounding." IIRC, it is possible to issue an instruction to the x86 FP unit which makes all operations work on 64-bit Doubles, even though there are 80-bits available internally. Which then means there's no requirement to spill intermediate results to memory in order to get the rounding correct. Ideally, -fexcess-precision should just affect whether the FP unit uses 80 or 64 bit Doubles. It shouldn't make any performance difference, although obviously the generated results may be different. As an aside, if you use the -optc-mfpmath=sse option, then you only get 64-bit Doubles anyway (on x86). cheers, Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

Philip Armstrong wrote:
On Thu, Jun 21, 2007 at 08:42:57PM +0100, Philip Armstrong wrote:
On Thu, Jun 21, 2007 at 03:29:17PM -0400, Mark T.B. Carroll wrote:
That's the old wiki. The new one gives the opposite advice! (As does the ghc manual):
http://www.haskell.org/ghc/docs/latest/html/users_guide/faster.html http://www.haskell.org/haskellwiki/Performance/Floating_Point
Incidentally, the latter page implies that ghc is being overly pessimistic when compilling FP code without -fexcess-precision:
"On x86 (and other platforms with GHC prior to version 6.4.2), use the -fexcess-precision flag to improve performance of floating-point intensive code (up to 2x speedups have been seen). This will keep more intermediates in registers instead of memory, at the expense of occasional differences in results due to unpredictable rounding."
IIRC, it is possible to issue an instruction to the x86 FP unit which makes all operations work on 64-bit Doubles, even though there are 80-bits available internally. Which then means there's no requirement to spill intermediate results to memory in order to get the rounding correct.
For some background on why GHC doesn't do this, see the comment "MORE FLOATING POINT MUSINGS..." in http://darcs.haskell.org/ghc/compiler/nativeGen/MachInstrs.hs The main problem is floats: even if you put the FPU into 64-bit mode, your float operations will be done at 64-bit precision. There are other technical problems that we found with doing this, the comment above elaborates. GHC passes -ffloat-store to GCC, unless you give the flag -fexcess-precision. The idea is to try to get reproducible floating-point results. The native code generator is unaffected by -fexcess-precision, but it produces rubbish floating-point code on x86 anyway.
Ideally, -fexcess-precision should just affect whether the FP unit uses 80 or 64 bit Doubles. It shouldn't make any performance difference, although obviously the generated results may be different.
As an aside, if you use the -optc-mfpmath=sse option, then you only get 64-bit Doubles anyway (on x86).
You probably want SSE2. If I ever get around to finishing it, the GHC native code generator will be able to generate SSE2 code on x86 someday, like it currently does for x86-64. For now, to get good FP performance on x86, you probably want -fvia-C -fexcess-precision -optc-mfpmath=sse2 Cheers, Simon

On Fri, Jun 22, 2007 at 01:16:54PM +0100, Simon Marlow wrote:
Philip Armstrong wrote:
IIRC, it is possible to issue an instruction to the x86 FP unit which makes all operations work on 64-bit Doubles, even though there are 80-bits available internally. Which then means there's no requirement to spill intermediate results to memory in order to get the rounding correct.
For some background on why GHC doesn't do this, see the comment "MORE FLOATING POINT MUSINGS..." in
http://darcs.haskell.org/ghc/compiler/nativeGen/MachInstrs.hs
Twisty. I guess 'slow, but correct, with switches to go faster at the price of correctness' is about the best option.
You probably want SSE2. If I ever get around to finishing it, the GHC native code generator will be able to generate SSE2 code on x86 someday, like it currently does for x86-64. For now, to get good FP performance on x86, you probably want
-fvia-C -fexcess-precision -optc-mfpmath=sse2
Reading the gcc manpage, I think you mean -optc-msse2 -optc-mfpmath=sse. -mfpmath=sse2 doesn't appear to be an option. (I note in passing that the ghc darcs head produces binaries from ray.hs which are about 15% slower than ghc 6.6.1 ones btw. Same optimisation options used both times.) cheers, Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

-fvia-C -fexcess-precision -optc-mfpmath=sse2
is there, or should there be a way to define -O "profiles" for ghc? so that -O would refer to the standard profile, -Ofp would refer to the combination above as a floating point optiimisation profile, other profiles might include things like -funbox-strict-fields, and -Omy42 would refer to my own favourite combination of flags.. perhaps this should be generalised to ghc flag profiles, to cover things like '-fno-monomorphism-restriction -fno-mono-pat-binds' or '-fglasgow-exts -fallow-undecidable-instances; and the like? just a thought, claus

On 22/06/07, Claus Reinke
perhaps this should be generalised to ghc flag profiles, to cover things like '-fno-monomorphism-restriction -fno-mono-pat-binds' or '-fglasgow-exts -fallow-undecidable-instances; and the like?
You just *know* someone's gonna abuse that to make a genuine -funroll-loops, right? ;-) D.

on second thought, user-defined profiles are a two-edged sword, negating the documentation advantages of in-source flags. better to handle that in the editor/ide. but predefined flag profiles would still seem to make sense? there is something wrong about this wealth of options. it is great that one has all that control over details, but it also makes it more difficult to get things right (eg, i was surprised that -O doesn't unbox strict fields by default). even a formula one driver doesn't control every lever himself, that's up to the team. for optimisations, i used to have a simple picture in mind (from my c days, i guess?), when ghci is no longer fast enough, that is: no -O: standard executables are fast enough, thank you -O: standard executables aren't fast enough, do something about it, but don't bother me with the details -O2: i need your best _safe_ optimisation efforts, and i'm prepared to pay for that with longer compilation times -O3: i need your absolute best optimisation efforts, and i'm prepared to verify myself that optimisations that cannot automatically be checked for safety have no serious negative effect on the results (it would be nice if you told me which potentially unsafe optimisations you used in compilation) on top of that, as an alternative to -O3, specific tradeoffs would be useful, where i specify whether i want to optimize for space or for time, or which kinds of optimization opportunities the compiler should pay attention to, such as strictness, unboxing, floating point ops, etc.. but even here i wouldn't want to give platform-specific options, i'd want the compiler to choose the most appropriate options, given my specified tradeoffs and emphasis, taking into account platform and self-knowledge. so, i'd say -Ofp, and the compiler might pick:
-fvia-C -fexcess-precision -optc-mfpmath=sse2
if i'm on a platform and compiler version where that is an appropriate selection of flags to get the best floating point performance. and it might pick a different selection of flags on a different platform, or with a different compiler version.
perhaps this should be generalised to ghc flag profiles, to cover things like '-fno-monomorphism-restriction -fno-mono-pat-binds' or '-fglasgow-exts -fallow-undecidable-instances; and the like?
that is a slightly different story, and it might be useful (a) to provide flag groups (-fno-mono*) and (b) to specify implication (just about every language extension flag implies -fglasgow-exts, so there's no need to specify that again, and there might be other opportunities for reducing groups of options with a single maximum in the implication order; one might even introduce pseudo-flags for grouping, such as -fhaskell2;-). claus

So float math in *slower* than double math in Haskell? That is interesting. Why is that? BTW, does Haskell support 80-bit "long double"s? The Intel CPU seems to use that format internally. -----Original Message----- From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Philip Armstrong Sent: Thursday, June 21, 2007 1:36 PM To: Haskell-Cafe@haskell.org; Sebastian Sylvan Subject: Re: [Haskell-cafe] Haskell version of ray tracer code is much slower than the original ML On Thu, Jun 21, 2007 at 12:25:44PM +0100, Sebastian Sylvan wrote:
Try using floats for the vector, and strict fields (add a ! to the fields in the data declaration).
Because the optimisation page on the haskell wiki is very explicit about never using Float when you can use Double, that's why. An older revision used Float and it was slower than the current one. Making the datatypes strict also makes no difference. I have tried the obvious things :)
That's the simplest possible thing I can think of after about two seconds of looking anyway. It appears that the ML version uses float so I don't understand why you would use Double for the Haskell version at all, and then think you could do any sort of valid comparisons between them...
OCaML floats are Doubles, at least on x86. cheers, Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Thu, Jun 21, 2007 at 08:15:36PM +0200, peterv wrote:
So float math in *slower* than double math in Haskell? That is interesting. Why is that?
BTW, does Haskell support 80-bit "long double"s? The Intel CPU seems to use that format internally.
As I understand things, that is the effect of using -fexcess-precision. Obviously this means that the behaviour of your program can change with seemingly trivial code rearrangements, but if you're messing with floating point numbers then you ought to know what you're doing anyway. Phil -- http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

Philip Armstrong wrote:
On Thu, Jun 21, 2007 at 08:15:36PM +0200, peterv wrote:
So float math in *slower* than double math in Haskell? That is interesting. Why is that?
BTW, does Haskell support 80-bit "long double"s? The Intel CPU seems to use that format internally.
As I understand things, that is the effect of using -fexcess-precision.
Obviously this means that the behaviour of your program can change with seemingly trivial code rearrangements,
Not just code rearrangements: your program will give different results depending on the optimisation settings, whether you compile with -fvia-C or -fasm, and the results will be different from those on a machine using fixed 32-bit or 64-bit precision floating point operations. Cheers, Simon
participants (11)
-
Andrew Coppin
-
Bulat Ziganshin
-
Claus Reinke
-
Derek Elkins
-
dons@cse.unsw.edu.au
-
Dougal Stanton
-
Jon Harrop
-
mark@ixod.org
-
peterv
-
Philip Armstrong
-
Simon Marlow