fast Eucl. dist. - Haskell vs C

Hello, For a while now, I've been trying to come up with a fast Haskell-only function which implements Euclidean distance in n-dimensional space. So far, I've been disappointed by the performance of my best effort in Haskell, compared to C. I'm hoping some of the Haskell experts and/or performance gurus on this list can help me out on resolving this, and also maybe shed some light on some strange (imho) things I've run into. My current best try uses the uvector package, has two 'vectors' of type (UArr Double) as input, and relies on the sumU and zipWithU functions which use streaming to compute the result: dist_fast :: UArr Double -> UArr Double -> Double dist_fast p1 p2 = sumDs `seq` sqrt sumDs where sumDs = sumU ds ds = zipWithU euclidean p1 p2 euclidean x y = d*d where d = x-y I've been benchmarking this function against various alternatives using the MicroBench [1] package, which allows to get accurate timings of single function calls. I've also mimicked the MicroBench approach in pure C, to get comparable timings for a C-only implementation. The C-only function is quite straightforward: double dist(int dim, double* p1, double* p2){ int i; double d = 0.0; for(i=0; i < dim; i++){ d += (p1[i] - p2[i])*(p1[i] - p2[i]); } return sqrt(d); } (Note that the optimizer takes care of the p1-p2 operation appearing twice in the code). All code is attached if you'd like to play around with it. All numbers reported below are using GHC 6.10.2 and gcc 4.3.3 on Linux/x86. The compilation details can be found in the Makefile attached, but in short, I'm using -O2 -fexcess-precision or -O2 -fexcess-precision -fvia-C -optc-O3 with GHC, and -O3 with gcc. Now the bad news: for small dimensions, e.g. 2D/3D-space, the dist_fast function is 70-240x slower than a pure C implementation, depending on the architecture. For example, in 2D-space on an Intel Pentium 4 (3.0GHz, 1M L2 cache), a single call to dist_fast takes about 1.75 microseconds (or 0.00000175s), while a call to dist_c (C implementation of Eucl. dist), takes about 0.025 microseconds (70x slowdown). On a Core 2 2.5GHz with 6MB L2 this maps to 1.9 and 0.008 microseconds, resp. (i.e. 240x slower), while on a Core i7 2.66GHz with 8MB L2 the numbers are 1.53 and 0.011 microseconds (i.e. 140x slower). For larger dimensions, the gap becomes less big, but is still worrying: 10D: 40-110x; 100D: 10-17x; >1kD: 2.5x-6x. I'm mostly interested in the range 10D to 100D, so seeing that Haskell is over 10x and up to 100x slower than C is kind of making me cry. I've tried some things to improve on this without much luck, on the contrary: *) Marking dist_fast for inlining makes things worse; in general the inlined version is 2x slower for low dimensionality, and even 5x slower for larger dimensionality. This was somewhat surprising to me... *) In a moment of weakness, I used the Foreign Function Interface to call the dist_c C-only implementation from Haskell. Unfortunately, there seems to be a lot of overhead in calling dist_c from Haskell. Most of the performance gain from using C melts away, and sometimes the performance of the FFI'd dist_c is 15-30% worse than the native dist_fast version (especially at low dimensionality). Only for the largest dimensionalities (10k-100kD), the FFI'd version reaches the performance of the native C approach. But, since I'm mostly interested in the 10-100D range, this is of little use to me. One thing I noticed is that compiling through C using -fvia-C -optc-O3 might be a bad idea, depending on your system. On an Intel Pentium 4 system, -fvia-C -optc-O3 was giving me a speedup of up 70% (large dim.), while on Core 2 and Core i7 it resulted in a slowdown of 15-20% ! I was using roughly equal versions of GCC with this, i.e. a self-bootstrapped GCC 4.3.x. So, my question to the list if simple: how can I get better performance out of a Haskell-only approach? Any comments/suggestions are highly appreciated. I'd prefer a Haskell-only approach, but my main concern is speed. The Euclidean distance function will be used quite heavily in various tools. I currently have a C-version of some of the tools, but the amount of code that is needed for those tools is becoming ridiculously big. I believe using Haskell will allow me to come up with a more easy to maintain code base. However, I don't want to pay a huge price for this in terms of performance. greetings, Kenneth [1] MicroBench: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/microbench -- Kenneth Hoste Paris research group - ELIS - Ghent University, Belgium email: kenneth.hoste@elis.ugent.be website: http://www.elis.ugent.be/~kehoste blog: http://boegel.kejo.be

My current best try uses the uvector package, has two 'vectors' of type (UArr Double) as input, and relies on the sumU and zipWithU functions which use streaming to compute the result:
dist_fast :: UArr Double -> UArr Double -> Double dist_fast p1 p2 = sumDs `seq` sqrt sumDs where sumDs = sumU ds ds = zipWithU euclidean p1 p2 euclidean x y = d*d where d = x-y
You'll probably want to make sure that 'euclidian' is specialized to the types you need (here 'Double'), not used overloaded for 'Num a=>a' (check -ddump-tc, or -ddump-simpl output). After that, unrolling the fused fold loop (uvector internal) might help a bit, but isn't there yet: http://hackage.haskell.org/trac/ghc/ticket/3123 http://hackage.haskell.org/trac/ghc/wiki/Inlining And even if that gets implemented, it doesn't apply directly to your case, where the loop is in a library, but you might want to control its unrolling in your client code. Having the loop unrolled by a default factor (8x or so) should help for loops like this, with little computation. Claus

dist_fast :: UArr Double -> UArr Double -> Double dist_fast p1 p2 = sumDs `seq` sqrt sumDs where sumDs = sumU ds ds = zipWithU euclidean p1 p2 euclidean x y = d*d where d = x-y
You'll probably want to make sure that 'euclidian' is specialized to the types you need (here 'Double'), not used overloaded for 'Num a=>a' (check -ddump-tc, or -ddump-simpl output).
Sorry about that misdirection - as it happened, I was looking at the tc output for 'dist_fast' (euclidean :: forall a. (Num a) => a -> a -> a), but the simpl output for 'dist_fast_inline' .., which uses things like __inline_me .. case Dist.sumU (Dist.$wzipWithU .. GHC.Num.- @ GHC.Types.Double GHC.Float.$f9 x_aLt y_aLv Once I actually add a 'dist_fast_inline_caller', that indirection disappears in the inlined code, just as it does for dist_fast itself. dist_fast_inlined_caller :: UArr Double -> UArr Double -> Bool dist_fast_inlined_caller p1 p2 = dist_fast_inlined p1 p2 > 2 However, in the simpl output for 'dist_fast_inline_caller', the 'sumU' and 'zipWithU' still don't seem to be fused - Don? Claus

claus.reinke:
dist_fast :: UArr Double -> UArr Double -> Double dist_fast p1 p2 = sumDs `seq` sqrt sumDs where sumDs = sumU ds ds = zipWithU euclidean p1 p2 euclidean x y = d*d where d = x-y
You'll probably want to make sure that 'euclidian' is specialized to the types you need (here 'Double'), not used overloaded for 'Num a=>a' (check -ddump-tc, or -ddump-simpl output).
Sorry about that misdirection - as it happened, I was looking at the tc output for 'dist_fast' (euclidean :: forall a. (Num a) => a -> a -> a), but the simpl output for 'dist_fast_inline' .., which uses things like
__inline_me .. case Dist.sumU (Dist.$wzipWithU .. GHC.Num.- @ GHC.Types.Double GHC.Float.$f9 x_aLt y_aLv
Once I actually add a 'dist_fast_inline_caller', that indirection disappears in the inlined code, just as it does for dist_fast itself.
dist_fast_inlined_caller :: UArr Double -> UArr Double -> Bool dist_fast_inlined_caller p1 p2 = dist_fast_inlined p1 p2 > 2
However, in the simpl output for 'dist_fast_inline_caller', the 'sumU' and 'zipWithU' still don't seem to be fused - Don?
All the 'seq's and so on should be unnecessary, and even so, I still get the expected fusion: import Control.Monad import System.Environment import System.IO import Data.Array.Vector {- dist :: UArr Double -> UArr Double -> Double dist p1 p2 = sumU (zipWithU euclidean p1 p2) where euclidean x y = d*d where d = x-y -} main = do [dim] <- map read `fmap` getArgs print $ dist_fast_inlined (enumFromToFracU 1.0 dim) (enumFromToFracU 1.0 dim) dist_fast_inlined :: UArr Double -> UArr Double -> Double {-# INLINE dist_fast_inlined #-} dist_fast_inlined p1 p2 = sumDs `seq` sqrt sumDs where sumDs = sumU ds ds = zipWithU euclidean p1 p2 euclidean x y = d*d where d = x-y {- 19 RuleFired 2 /## 3 SC:$wfold0 5 int2Double# 1 map 1 mapList 3 streamU/unstreamU 2 truncate/Double->Int 1 unpack 1 unpack-list $s$wfold_s1TB :: Double# -> Double# -> Double# -> Double# -}

Once I actually add a 'dist_fast_inline_caller', that indirection disappears in the inlined code, just as it does for dist_fast itself.
dist_fast_inlined_caller :: UArr Double -> UArr Double -> Bool dist_fast_inlined_caller p1 p2 = dist_fast_inlined p1 p2 > 2
However, in the simpl output for 'dist_fast_inline_caller', the 'sumU' and 'zipWithU' still don't seem to be fused - Don?
All the 'seq's and so on should be unnecessary, and even so, I still get the expected fusion:
As I said, I don't get the fusion if I just add the function above to the original Dist.hs, export it and compile the module with '-c -O2 -ddump-simpl': Dist.dist_fast_inlined_caller = \ (w1_s1nb :: Data.Array.Vector.UArr.UArr GHC.Types.Double) (w2_s1nc :: Data.Array.Vector.UArr.UArr GHC.Types.Double) -> case (Dist.$wzipWithU Dist.lvl2 w1_s1nb w2_s1nc) `cast` (trans Data.Array.Vector.UArr.TFCo:R56:UArr Data.Array.Vector.UArr.NTCo:R56:UArr :: Data.Array.Vector.UArr.UArr GHC.Types.Double ~ Data.Array.Vector.Prim.BUArr.BUArr GHC.Types.Double) of _ { Data.Array.Vector.Prim.BUArr.BUArr ipv_s1lb ipv1_s1lc ipv2_s1ld -> letrec { $wfold_s1nN :: GHC.Prim.Double# -> GHC.Prim.Int# -> GHC.Prim.Double# LclId [Arity 2 Str: DmdType LL] $wfold_s1nN = \ (ww_s1mZ :: GHC.Prim.Double#) (ww1_s1n3 :: GHC.Prim.Int#) -> case GHC.Prim.==# ww1_s1n3 ipv1_s1lc of _ { GHC.Bool.False -> $wfold_s1nN (GHC.Prim.+## ww_s1mZ (GHC.Prim.indexDoubleArray# ipv2_s1ld (GHC.Prim.+# ipv_s1lb ww1_s1n3))) (GHC.Prim.+# ww1_s1n3 1); GHC.Bool.True -> ww_s1mZ }; } in case $wfold_s1nN 0.0 0 of ww_s1n7 { __DEFAULT -> GHC.Prim.>## (GHC.Prim.sqrtDouble# ww_s1n7) 2.0 } } Claus

claus.reinke:
Once I actually add a 'dist_fast_inline_caller', that indirection disappears in the inlined code, just as it does for dist_fast itself.
dist_fast_inlined_caller :: UArr Double -> UArr Double -> Bool dist_fast_inlined_caller p1 p2 = dist_fast_inlined p1 p2 > 2
However, in the simpl output for 'dist_fast_inline_caller', the 'sumU' and 'zipWithU' still don't seem to be fused - Don?
All the 'seq's and so on should be unnecessary, and even so, I still get the expected fusion:
As I said, I don't get the fusion if I just add the function above to the original Dist.hs, export it and compile the module with '-c -O2 -ddump-simpl':
I can't reproduce this. Does the complete program fragment I posted earlier yield the desired result? -- Don

As I said, I don't get the fusion if I just add the function above to the original Dist.hs, export it and compile the module with '-c -O2 -ddump-simpl':
I can't reproduce this.
Interesting. I'm using ghc 6.11.20090320 (windows), uvector-0.1.0.3. I attach the modified Dist.hs and its simpl output, created via: ghc -c Dist.hs -O2 -ddump-tc -ddump-simpl-stats -ddump-simpl > Dist.dumps Perhaps others can confirm the effect? Note that the 'dist_fast' in the same module does get fused, so it is not likely an options issue. I still suspect that the inlining of the 'Dist.zipWith' wrapper in the 'dist_fast_inlined' '__inline_me' has some significance - it is odd to see inlined code in an '__inline_me' and the fusion rule won't trigger on 'Dist.sumU . Dist.$wzipWithU', right?
Does the complete program fragment I posted earlier yield the desired result?
Yes. Note that the original poster also reported slowdown from use of 'dist_fast_inlined'. Claus

On May 18, 2009, at 20:54 , Claus Reinke wrote:
As I said, I don't get the fusion if I just add the function above to the original Dist.hs, export it and compile the module with '-c -O2 -ddump-simpl': I can't reproduce this.
Interesting. I'm using ghc 6.11.20090320 (windows), uvector-0.1.0.3. I attach the modified Dist.hs and its simpl output, created via:
ghc -c Dist.hs -O2 -ddump-tc -ddump-simpl-stats -ddump-simpl > Dist.dumps
Perhaps others can confirm the effect? Note that the 'dist_fast' in the same module does get fused, so it is not likely an options issue. I still suspect that the inlining of the 'Dist.zipWith' wrapper in the 'dist_fast_inlined' '__inline_me' has some significance - it is odd to see inlined code in an '__inline_me' and the fusion rule won't trigger on 'Dist.sumU . Dist. $wzipWithU', right?
As far as I can tell, the dist_fast_inlined doesn't get fused, i.e. I'm seeing zipWithU and sumU being used in it, which is not the case in dist_fast. This is on OS X/PowerPC, using GHC 6.10.1.
Does the complete program fragment I posted earlier yield the desired result?
Yes. Note that the original poster also reported slowdown from use of 'dist_fast_inlined'.
Don, you were defining dist inside the main module, while in our case the dist functions are defined in a seperate Dist.hs module... Would that matter? K. -- Kenneth Hoste Paris research group - ELIS - Ghent University, Belgium email: kenneth.hoste@elis.ugent.be website: http://www.elis.ugent.be/~kehoste blog: http://boegel.kejo.be

On May 18, 2009, at 15:28 , Claus Reinke wrote:
My current best try uses the uvector package, has two 'vectors' of type (UArr Double) as input, and relies on the sumU and zipWithU functions which use streaming to compute the result: dist_fast :: UArr Double -> UArr Double -> Double dist_fast p1 p2 = sumDs `seq` sqrt sumDs where sumDs = sumU ds ds = zipWithU euclidean p1 p2 euclidean x y = d*d where d = x-y
You'll probably want to make sure that 'euclidian' is specialized to the types you need (here 'Double'), not used overloaded for 'Num a=>a' (check -ddump-tc, or -ddump-simpl output).
I understand from your later post that is was in fact specialized, but how do I make sure it _is_ specialized? Can I just add a type signature in the dist_fast definition for euclidean, or should I define euclidean outside of dist_fast, with an explicit type signature? If the latter, won't that hurt performance? Or should marking it INLINE take care of that?
After that, unrolling the fused fold loop (uvector internal) might help a bit, but isn't there yet:
http://hackage.haskell.org/trac/ghc/ticket/3123 http://hackage.haskell.org/trac/ghc/wiki/Inlining
And even if that gets implemented, it doesn't apply directly to your case, where the loop is in a library, but you might want to control its unrolling in your client code. Having the loop unrolled by a default factor (8x or so) should help for loops like this, with little computation.
This seems rather serious, and might be one of the bigger reasons why I'm getting nowhere close to C in terms of performance... The loop body is ridiculously small, so it would make sense to unroll it somewhat to help avoid the loop overhead. However, it seems like GHC isn't able to do that now. Is there any way to unroll the loop myself, to speed things up? Seems hard, because I'm using uvector... K. -- Kenneth Hoste Paris research group - ELIS - Ghent University, Belgium email: kenneth.hoste@elis.ugent.be website: http://www.elis.ugent.be/~kehoste blog: http://boegel.kejo.be

I understand from your later post that is was in fact specialized, but how do I make sure it _is_ specialized?
-ddump-tc seems to give the generalized type, so it seems you'd need to look at the -ddump-simpl output if you want to know whether a local function is specialized. http://www.haskell.org/haskellwiki/Performance/GHC#Looking_at_the_Core
Can I just add a type signature in the dist_fast definition for euclidean,
If you need it at exactly one type, that is the simplest way. There's also the SPECIALIZE pragma http://www.haskell.org/ghc/docs/latest/html/users_guide/pragmas.html#special... and for local and non-exported functions '-O' should enable auto-specialization http://www.haskell.org/ghc/docs/latest/html/users_guide/faster.html
After that, unrolling the fused fold loop (uvector internal) might help a bit, but isn't there yet:
http://hackage.haskell.org/trac/ghc/ticket/3123 http://hackage.haskell.org/trac/ghc/wiki/Inlining
And even if that gets implemented, it doesn't apply directly to your case, where the loop is in a library, but you might want to control its unrolling in your client code. Having the loop unrolled by a default factor (8x or so) should help for loops like this, with little computation.
This seems rather serious, and might be one of the bigger reasons why I'm getting nowhere close to C in terms of performance...
You should be able to get near (a small factor) without it, but not having it leaves a substantial gap in performance, especially for simple loop bodies (there is also the case of enabling fusion over multiple loop iterations, but that may call for proper fix-point fusion).
The loop body is ridiculously small, so it would make sense to unroll it somewhat to help avoid the loop overhead. However, it seems like GHC isn't able to do that now.
Apart from the links above, the new backend also has relevant TODO items: http://hackage.haskell.org/trac/ghc/wiki/BackEndNotes
Is there any way to unroll the loop myself, to speed things up? Seems hard, because I'm using uvector...
You could 'cabal unpack' uvector, hand-unroll the core loop all its operations get fused into, then reinstall the modified package.. (perhaps that should be a package configuration option, at least until GHC gets loop or recursion unrolling - since this would be a temporary workaround, it would probably not be worth it to have multiple package versions with different unroll factors; if this actually helps uvector users with performance in practice, they could suggest it as a patch). Claus

kenneth.hoste:
Hello,
For a while now, I've been trying to come up with a fast Haskell-only function which implements Euclidean distance in n-dimensional space.
So far, I've been disappointed by the performance of my best effort in Haskell, compared to C. I'm hoping some of the Haskell experts and/or performance gurus on this list can help me out on resolving this, and also maybe shed some light on some strange (imho) things I've run into.
My current best try uses the uvector package, has two 'vectors' of type (UArr Double) as input, and relies on the sumU and zipWithU functions which use streaming to compute the result:
dist_fast :: UArr Double -> UArr Double -> Double dist_fast p1 p2 = sumDs `seq` sqrt sumDs where sumDs = sumU ds ds = zipWithU euclidean p1 p2 euclidean x y = d*d where d = x-y
The problem in your uvector code is the use of lists, rather than uvector generators. Replace [1..n] with enumFromTo: import Control.Monad import System.Environment import System.IO import Data.Array.Vector dist :: UArr Double -> UArr Double -> Double dist p1 p2 = sumU (zipWithU euclidean p1 p2) where euclidean x y = d*d where d = x-y main = do [dim] <- map read `fmap` getArgs print $ dist (enumFromToFracU 1.0 dim) (enumFromToFracU 1.0 dim) Now the entire thiing will fuse to a loop. $s$wfold_s1RR :: Double# -> Double# -> Double# -> Double# $s$wfold_s1RR = \ (sc_s1RH :: Double#) (sc1_s1RI :: Double#) (sc2_s1RJ :: Double#) -> case >## sc1_s1RI a5_s1QR of wild4_a1tn { False -> case >## sc_s1RH a5_s1QR of wild5_X1vx { False -> let { x1_a1Jg [ALWAYS Just L] :: Double# x1_a1Jg = -## sc1_s1RI sc_s1RH } in $s$wfold_s1RR (+## sc_s1RH 1.0) (+## sc1_s1RI 1.0) (+## sc2_s1RJ (*## x1_a1Jg x1_a1Jg)); True -> sc2_s1RJ }; True -> sc2_s1RJ }; } in case $s$wfold_s1RR 1.0 1.0 0.0 of ww_s1QH { __DEFAULT -> a19 (D# ww_s1QH) and this assembly: $ ghc -O2 -fvia-C -optc-O3 s1T9_info: movsd 5(%rbx), %xmm7 ucomisd %xmm7, %xmm6 ja .L15 ucomisd %xmm7, %xmm5 jbe .L18 .L14: .L15: movsd (%rbp), %xmm5 leaq 8(%rbp), %rbp jmp *(%rbp) .L18: movapd %xmm6, %xmm7 subsd %xmm5, %xmm7 mulsd %xmm7, %xmm7 addsd (%rbp), %xmm7 movsd %xmm7, (%rbp) movsd .LC0(%rip), %xmm7 addsd %xmm7, %xmm5 addsd %xmm7, %xmm6 jmp s1T9_info Which I'd wager will match the C, which has to allocate the two arrays (GHC essentially decides it doesn't need the arrays any more. -- Don

Hi Don (and cafe), Given the example you just posted, is there a simple way to generate the de-sugared haskell / core / STG / labelled-assembly versions of a piece of haskell code? For instance, how did you generate the content below? I guess this is the core language version? I'm a C/C++ coder and looking for the equivalent of "Show Disassembly". Cheers, Sam -----Original Message----- From: haskell-cafe-bounces@haskell.org on behalf of Don Stewart Sent: Mon 18/05/2009 14:50 To: Kenneth Hoste Cc: Haskell Cafe mailing list Subject: Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C kenneth.hoste:
Hello,
For a while now, I've been trying to come up with a fast Haskell-only function which implements Euclidean distance in n-dimensional space.
So far, I've been disappointed by the performance of my best effort in Haskell, compared to C. I'm hoping some of the Haskell experts and/or performance gurus on this list can help me out on resolving this, and also maybe shed some light on some strange (imho) things I've run into.
My current best try uses the uvector package, has two 'vectors' of type (UArr Double) as input, and relies on the sumU and zipWithU functions which use streaming to compute the result:
dist_fast :: UArr Double -> UArr Double -> Double dist_fast p1 p2 = sumDs `seq` sqrt sumDs where sumDs = sumU ds ds = zipWithU euclidean p1 p2 euclidean x y = d*d where d = x-y
The problem in your uvector code is the use of lists, rather than uvector generators. Replace [1..n] with enumFromTo: import Control.Monad import System.Environment import System.IO import Data.Array.Vector dist :: UArr Double -> UArr Double -> Double dist p1 p2 = sumU (zipWithU euclidean p1 p2) where euclidean x y = d*d where d = x-y main = do [dim] <- map read `fmap` getArgs print $ dist (enumFromToFracU 1.0 dim) (enumFromToFracU 1.0 dim) Now the entire thiing will fuse to a loop. $s$wfold_s1RR :: Double# -> Double# -> Double# -> Double# $s$wfold_s1RR = \ (sc_s1RH :: Double#) (sc1_s1RI :: Double#) (sc2_s1RJ :: Double#) -> case >## sc1_s1RI a5_s1QR of wild4_a1tn { False -> case >## sc_s1RH a5_s1QR of wild5_X1vx { False -> let { x1_a1Jg [ALWAYS Just L] :: Double# x1_a1Jg = -## sc1_s1RI sc_s1RH } in $s$wfold_s1RR (+## sc_s1RH 1.0) (+## sc1_s1RI 1.0) (+## sc2_s1RJ (*## x1_a1Jg x1_a1Jg)); True -> sc2_s1RJ }; True -> sc2_s1RJ }; } in case $s$wfold_s1RR 1.0 1.0 0.0 of ww_s1QH { __DEFAULT -> a19 (D# ww_s1QH) and this assembly: $ ghc -O2 -fvia-C -optc-O3 s1T9_info: movsd 5(%rbx), %xmm7 ucomisd %xmm7, %xmm6 ja .L15 ucomisd %xmm7, %xmm5 jbe .L18 .L14: .L15: movsd (%rbp), %xmm5 leaq 8(%rbp), %rbp jmp *(%rbp) .L18: movapd %xmm6, %xmm7 subsd %xmm5, %xmm7 mulsd %xmm7, %xmm7 addsd (%rbp), %xmm7 movsd %xmm7, (%rbp) movsd .LC0(%rip), %xmm7 addsd %xmm7, %xmm5 addsd %xmm7, %xmm6 jmp s1T9_info Which I'd wager will match the C, which has to allocate the two arrays (GHC essentially decides it doesn't need the arrays any more. -- Don _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Yes, I use the ghc-core tool: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/ghc-core sam.martin:
Hi Don (and cafe),
Given the example you just posted, is there a simple way to generate the de-sugared haskell / core / STG / labelled-assembly versions of a piece of haskell code? For instance, how did you generate the content below? I guess this is the core language version?
I'm a C/C++ coder and looking for the equivalent of "Show Disassembly".
Cheers, Sam
-----Original Message----- From: haskell-cafe-bounces@haskell.org on behalf of Don Stewart Sent: Mon 18/05/2009 14:50 To: Kenneth Hoste Cc: Haskell Cafe mailing list Subject: Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C
kenneth.hoste:
Hello,
For a while now, I've been trying to come up with a fast Haskell-only function which implements Euclidean distance in n-dimensional space.
So far, I've been disappointed by the performance of my best effort in Haskell, compared to C. I'm hoping some of the Haskell experts and/or performance gurus on this list can help me out on resolving this, and also maybe shed some light on some strange (imho) things I've run into.
My current best try uses the uvector package, has two 'vectors' of type (UArr Double) as input, and relies on the sumU and zipWithU functions which use streaming to compute the result:
dist_fast :: UArr Double -> UArr Double -> Double dist_fast p1 p2 = sumDs `seq` sqrt sumDs where sumDs = sumU ds ds = zipWithU euclidean p1 p2 euclidean x y = d*d where d = x-y
The problem in your uvector code is the use of lists, rather than uvector generators. Replace [1..n] with enumFromTo:
import Control.Monad import System.Environment import System.IO import Data.Array.Vector
dist :: UArr Double -> UArr Double -> Double dist p1 p2 = sumU (zipWithU euclidean p1 p2) where euclidean x y = d*d where d = x-y
main = do [dim] <- map read `fmap` getArgs
print $ dist (enumFromToFracU 1.0 dim) (enumFromToFracU 1.0 dim)
Now the entire thiing will fuse to a loop.
$s$wfold_s1RR :: Double# -> Double# -> Double# -> Double#
$s$wfold_s1RR = \ (sc_s1RH :: Double#) (sc1_s1RI :: Double#) (sc2_s1RJ :: Double#) -> case >## sc1_s1RI a5_s1QR of wild4_a1tn { False -> case >## sc_s1RH a5_s1QR of wild5_X1vx { False -> let { x1_a1Jg [ALWAYS Just L] :: Double#
x1_a1Jg = -## sc1_s1RI sc_s1RH } in $s$wfold_s1RR (+## sc_s1RH 1.0) (+## sc1_s1RI 1.0) (+## sc2_s1RJ (*## x1_a1Jg x1_a1Jg)); True -> sc2_s1RJ }; True -> sc2_s1RJ }; } in case $s$wfold_s1RR 1.0 1.0 0.0 of ww_s1QH { __DEFAULT -> a19 (D# ww_s1QH)
and this assembly:
$ ghc -O2 -fvia-C -optc-O3
s1T9_info: movsd 5(%rbx), %xmm7 ucomisd %xmm7, %xmm6 ja .L15 ucomisd %xmm7, %xmm5 jbe .L18 .L14: .L15: movsd (%rbp), %xmm5 leaq 8(%rbp), %rbp jmp *(%rbp) .L18: movapd %xmm6, %xmm7 subsd %xmm5, %xmm7 mulsd %xmm7, %xmm7 addsd (%rbp), %xmm7 movsd %xmm7, (%rbp) movsd .LC0(%rip), %xmm7 addsd %xmm7, %xmm5 addsd %xmm7, %xmm6 jmp s1T9_info
Which I'd wager will match the C, which has to allocate the two arrays (GHC essentially decides it doesn't need the arrays any more.
-- Don _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Brilliant! Thanks. -----Original Message----- From: Don Stewart [mailto:dons@galois.com] Sent: Mon 18/05/2009 16:21 To: Sam Martin Cc: Kenneth Hoste; Haskell Cafe mailing list Subject: Re: de-sugared code? (branch from: fast Eucl. dist. - Haskell vs C) Yes, I use the ghc-core tool: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/ghc-core sam.martin:
Hi Don (and cafe),
Given the example you just posted, is there a simple way to generate the de-sugared haskell / core / STG / labelled-assembly versions of a piece of haskell code? For instance, how did you generate the content below? I guess this is the core language version?
I'm a C/C++ coder and looking for the equivalent of "Show Disassembly".
Cheers, Sam
-----Original Message----- From: haskell-cafe-bounces@haskell.org on behalf of Don Stewart Sent: Mon 18/05/2009 14:50 To: Kenneth Hoste Cc: Haskell Cafe mailing list Subject: Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C
kenneth.hoste:
Hello,
For a while now, I've been trying to come up with a fast Haskell-only function which implements Euclidean distance in n-dimensional space.
So far, I've been disappointed by the performance of my best effort in Haskell, compared to C. I'm hoping some of the Haskell experts and/or performance gurus on this list can help me out on resolving this, and also maybe shed some light on some strange (imho) things I've run into.
My current best try uses the uvector package, has two 'vectors' of type (UArr Double) as input, and relies on the sumU and zipWithU functions which use streaming to compute the result:
dist_fast :: UArr Double -> UArr Double -> Double dist_fast p1 p2 = sumDs `seq` sqrt sumDs where sumDs = sumU ds ds = zipWithU euclidean p1 p2 euclidean x y = d*d where d = x-y
The problem in your uvector code is the use of lists, rather than uvector generators. Replace [1..n] with enumFromTo:
import Control.Monad import System.Environment import System.IO import Data.Array.Vector
dist :: UArr Double -> UArr Double -> Double dist p1 p2 = sumU (zipWithU euclidean p1 p2) where euclidean x y = d*d where d = x-y
main = do [dim] <- map read `fmap` getArgs
print $ dist (enumFromToFracU 1.0 dim) (enumFromToFracU 1.0 dim)
Now the entire thiing will fuse to a loop.
$s$wfold_s1RR :: Double# -> Double# -> Double# -> Double#
$s$wfold_s1RR = \ (sc_s1RH :: Double#) (sc1_s1RI :: Double#) (sc2_s1RJ :: Double#) -> case >## sc1_s1RI a5_s1QR of wild4_a1tn { False -> case >## sc_s1RH a5_s1QR of wild5_X1vx { False -> let { x1_a1Jg [ALWAYS Just L] :: Double#
x1_a1Jg = -## sc1_s1RI sc_s1RH } in $s$wfold_s1RR (+## sc_s1RH 1.0) (+## sc1_s1RI 1.0) (+## sc2_s1RJ (*## x1_a1Jg x1_a1Jg)); True -> sc2_s1RJ }; True -> sc2_s1RJ }; } in case $s$wfold_s1RR 1.0 1.0 0.0 of ww_s1QH { __DEFAULT -> a19 (D# ww_s1QH)
and this assembly:
$ ghc -O2 -fvia-C -optc-O3
s1T9_info: movsd 5(%rbx), %xmm7 ucomisd %xmm7, %xmm6 ja .L15 ucomisd %xmm7, %xmm5 jbe .L18 .L14: .L15: movsd (%rbp), %xmm5 leaq 8(%rbp), %rbp jmp *(%rbp) .L18: movapd %xmm6, %xmm7 subsd %xmm5, %xmm7 mulsd %xmm7, %xmm7 addsd (%rbp), %xmm7 movsd %xmm7, (%rbp) movsd .LC0(%rip), %xmm7 addsd %xmm7, %xmm5 addsd %xmm7, %xmm6 jmp s1T9_info
Which I'd wager will match the C, which has to allocate the two arrays (GHC essentially decides it doesn't need the arrays any more.
-- Don _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

That's the output of ghc -ddump-simpl.
2009/5/18 Sam Martin
Hi Don (and cafe),
Given the example you just posted, is there a simple way to generate the de-sugared haskell / core / STG / labelled-assembly versions of a piece of haskell code? For instance, how did you generate the content below? I guess this is the core language version?
I'm a C/C++ coder and looking for the equivalent of "Show Disassembly".
Cheers, Sam
-----Original Message----- From: haskell-cafe-bounces@haskell.org on behalf of Don Stewart Sent: Mon 18/05/2009 14:50 To: Kenneth Hoste Cc: Haskell Cafe mailing list Subject: Re: [Haskell-cafe] fast Eucl. dist. - Haskell vs C
kenneth.hoste:
Hello,
For a while now, I've been trying to come up with a fast Haskell-only function which implements Euclidean distance in n-dimensional space.
So far, I've been disappointed by the performance of my best effort in Haskell, compared to C. I'm hoping some of the Haskell experts and/or performance gurus on this list can help me out on resolving this, and also maybe shed some light on some strange (imho) things I've run into.
My current best try uses the uvector package, has two 'vectors' of type (UArr Double) as input, and relies on the sumU and zipWithU functions which use streaming to compute the result:
dist_fast :: UArr Double -> UArr Double -> Double dist_fast p1 p2 = sumDs `seq` sqrt sumDs where sumDs = sumU ds ds = zipWithU euclidean p1 p2 euclidean x y = d*d where d = x-y
The problem in your uvector code is the use of lists, rather than uvector generators. Replace [1..n] with enumFromTo:
import Control.Monad import System.Environment import System.IO import Data.Array.Vector
dist :: UArr Double -> UArr Double -> Double dist p1 p2 = sumU (zipWithU euclidean p1 p2) where euclidean x y = d*d where d = x-y
main = do [dim] <- map read `fmap` getArgs
print $ dist (enumFromToFracU 1.0 dim) (enumFromToFracU 1.0 dim)
Now the entire thiing will fuse to a loop.
$s$wfold_s1RR :: Double# -> Double# -> Double# -> Double#
$s$wfold_s1RR = \ (sc_s1RH :: Double#) (sc1_s1RI :: Double#) (sc2_s1RJ :: Double#) -> case >## sc1_s1RI a5_s1QR of wild4_a1tn { False -> case >## sc_s1RH a5_s1QR of wild5_X1vx { False -> let { x1_a1Jg [ALWAYS Just L] :: Double#
x1_a1Jg = -## sc1_s1RI sc_s1RH } in $s$wfold_s1RR (+## sc_s1RH 1.0) (+## sc1_s1RI 1.0) (+## sc2_s1RJ (*## x1_a1Jg x1_a1Jg)); True -> sc2_s1RJ }; True -> sc2_s1RJ }; } in case $s$wfold_s1RR 1.0 1.0 0.0 of ww_s1QH { __DEFAULT -> a19 (D# ww_s1QH)
and this assembly:
$ ghc -O2 -fvia-C -optc-O3
s1T9_info: movsd 5(%rbx), %xmm7 ucomisd %xmm7, %xmm6 ja .L15 ucomisd %xmm7, %xmm5 jbe .L18 .L14: .L15: movsd (%rbp), %xmm5 leaq 8(%rbp), %rbp jmp *(%rbp) .L18: movapd %xmm6, %xmm7 subsd %xmm5, %xmm7 mulsd %xmm7, %xmm7 addsd (%rbp), %xmm7 movsd %xmm7, (%rbp) movsd .LC0(%rip), %xmm7 addsd %xmm7, %xmm5 addsd %xmm7, %xmm6 jmp s1T9_info
Which I'd wager will match the C, which has to allocate the two arrays (GHC essentially decides it doesn't need the arrays any more.
-- Don _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Eugene Kirpichov Web IR developer, market.yandex.ru
participants (5)
-
Claus Reinke
-
Don Stewart
-
Eugene Kirpichov
-
Kenneth Hoste
-
Sam Martin