
hi, i've attached an example program which seems to indicate that the magnitude function from Data.Complex is very slow compared to a more naive implementation (for Complex Float). on my machine (intel core2 duo, osx 10.4) the CPU time using the library function is about 6-7 times as much as when using the other function. any ideas what might be going on? any flaws in my measurement code? thanks, <sk>

On Thu, 17 Jul 2008, stefan kersten wrote:
i've attached an example program which seems to indicate that the magnitude function from Data.Complex is very slow compared to a more naive implementation (for Complex Float). on my machine (intel core2 duo, osx 10.4) the CPU time using the library function is about 6-7 times as much as when using the other function. any ideas what might be going on? any flaws in my measurement code?
Complex.magnitude must prevent overflows, that is, if you just square 1e200::Double you get an overflow, although the end result may be also around 1e200. I guess, that to this end Complex.magnitude will separate mantissa and exponent, but this is done via Integers, I'm afraid.

On Thu, Jul 17, 2008 at 05:18:01PM +0200, Henning Thielemann wrote:
On Thu, 17 Jul 2008, stefan kersten wrote:
i've attached an example program which seems to indicate that the magnitude function from Data.Complex is very slow compared to a more naive implementation (for Complex Float). on my machine (intel core2 duo, osx 10.4) the CPU time using the library function is about 6-7 times as much as when using the other function. any ideas what might be going on? any flaws in my measurement code?
Complex.magnitude must prevent overflows, that is, if you just square 1e200::Double you get an overflow, although the end result may be also around 1e200. I guess, that to this end Complex.magnitude will separate mantissa and exponent, but this is done via Integers, I'm afraid.
Here's the code: {-# SPECIALISE magnitude :: Complex Double -> Double #-} magnitude :: (RealFloat a) => Complex a -> a magnitude (x:+y) = scaleFloat k (sqrt ((scaleFloat mk x)^(2::Int) + (scaleFloat mk y)^(2::Int))) where k = max (exponent x) (exponent y) mk = - k So the slowdown may be due to the scaling, presumably to prevent overflow as you say. However, the e^(2 :: Int) may also be causing a slowdown, as (^) is lazy in its first argument; I'm not sure if there is a rule that will rewrite that to e*e. Stefan, perhaps you can try timing with the above code, and also with: {-# SPECIALISE magnitude :: Complex Double -> Double #-} magnitude :: (RealFloat a) => Complex a -> a magnitude (x:+y) = scaleFloat k (sqrt (sqr (scaleFloat mk x) + sqr (scaleFloat mk y))) where k = max (exponent x) (exponent y) mk = - k sqr x = x * x and let us know what the results are? Thanks Ian

On 17.07.2008, at 17:42, Ian Lynagh wrote:
On Thu, Jul 17, 2008 at 05:18:01PM +0200, Henning Thielemann wrote:
Complex.magnitude must prevent overflows, that is, if you just square 1e200::Double you get an overflow, although the end result may be also around 1e200. I guess, that to this end Complex.magnitude will separate mantissa and exponent, but this is done via Integers, I'm afraid.
Here's the code:
{-# SPECIALISE magnitude :: Complex Double -> Double #-} magnitude :: (RealFloat a) => Complex a -> a magnitude (x:+y) = scaleFloat k (sqrt ((scaleFloat mk x)^(2::Int) + (scaleFloat mk y)^(2::Int))) where k = max (exponent x) (exponent y) mk = - k
So the slowdown may be due to the scaling, presumably to prevent overflow as you say. However, the e^(2 :: Int) may also be causing a slowdown, as (^) is lazy in its first argument; I'm not sure if there is a rule that will rewrite that to e*e. Stefan, perhaps you can try timing with the above code, and also with:
{-# SPECIALISE magnitude :: Complex Double -> Double #-} magnitude :: (RealFloat a) => Complex a -> a magnitude (x:+y) = scaleFloat k (sqrt (sqr (scaleFloat mk x) + sqr (scaleFloat mk y))) where k = max (exponent x) (exponent y) mk = - k sqr x = x * x
and let us know what the results are?
thanks ian, here are the absolute runtimes (non-instrumented code) and the corresponding entries in the profile: c_magnitude0 (Complex.Data.magnitude) 0m7.249s c_magnitude1 (non-scaling version) 0m1.176s c_magnitude2 (scaling version, strict square) 0m3.278s %time %alloc (inherited) c_magnitude0 91.6 90.2 c_magnitude1 41.7 49.6 c_magnitude2 81.5 71.1 interestingly, just pasting the original ghc library implementation seems to slow things down considerably (0m12.264s) when compiling with -O2 -funbox-strict-fields -fvia-C -optc-O2 -fdicts-cheap -fno-method-sharing -fglasgow-exts when leaving away -fdicts-cheap and -fno-method-sharing the execution time for the pasted library code reduces to 0m6.873s. seems like some options that are useful (or even necessary?) for stream fusion rule reduction, may produce non-optimal code in other cases? <sk>

On Thu, Jul 17, 2008 at 06:56:52PM +0200, stefan kersten wrote:
c_magnitude0 (Complex.Data.magnitude) 0m7.249s c_magnitude1 (non-scaling version) 0m1.176s c_magnitude2 (scaling version, strict square) 0m3.278s
Thanks! I've filed a bug here: http://hackage.haskell.org/trac/ghc/ticket/2450 Ian

On Thu, 17 Jul 2008, Ian Lynagh wrote:
On Thu, Jul 17, 2008 at 05:18:01PM +0200, Henning Thielemann wrote:
Complex.magnitude must prevent overflows, that is, if you just square 1e200::Double you get an overflow, although the end result may be also around 1e200. I guess, that to this end Complex.magnitude will separate mantissa and exponent, but this is done via Integers, I'm afraid.
Here's the code:
{-# SPECIALISE magnitude :: Complex Double -> Double #-} magnitude :: (RealFloat a) => Complex a -> a magnitude (x:+y) = scaleFloat k (sqrt ((scaleFloat mk x)^(2::Int) + (scaleFloat mk y)^(2::Int))) where k = max (exponent x) (exponent y) mk = - k
So the slowdown may be due to the scaling, presumably to prevent overflow as you say. However, the e^(2 :: Int) may also be causing a slowdown, as (^) is lazy in its first argument; I'm not sure if there is a rule that will rewrite that to e*e.
I guess the rule should be INLINE. Indeed, here you can see http://darcs.haskell.org/packages/base/GHC/Float.lhs that scaleFloat calls decodeFloat and encodeFloat. Both of them use Integer. I expect that most FPUs are able to divide a floating point number into exponent and mantissa, but GHC seems not to have a primitive for it? As a quick work-around, Complex.magnitude could check whether the arguments are too big, then use scaleFloat and otherwise it should use the naive algorithm. In the math library of C there is the function 'frexp' http://www.codecogs.com/reference/c/math.h/frexp.php?alias= but calling an external functions will be slower than a comparison that can be performed by a single FPU instruction.

On 17.07.2008, at 17:18, Henning Thielemann wrote:
i've attached an example program which seems to indicate that the magnitude function from Data.Complex is very slow compared to a more naive implementation (for Complex Float). on my machine (intel core2 duo, osx 10.4) the CPU time using the library function is about 6-7 times as much as when using the other function. any ideas what might be going on? any flaws in my measurement code?
Complex.magnitude must prevent overflows, that is, if you just square 1e200::Double you get an overflow, although the end result may be also around 1e200. I guess, that to this end Complex.magnitude will separate mantissa and exponent, but this is done via Integers, I'm afraid.
very enlightening, thanks! it might be possible to (almost) get the best of two worlds (ported from dejagnu's libm): c_magnitude4 :: Complex Float -> Float c_magnitude4 (x:+y) = if x' < y' then mag y' x' else mag x' y' where x' = abs x y' = abs y sqr x = x * x mag a 0 = a mag a b = a * sqrt (1 + sqr (b/a)) is fast and doesn't overflow intermediate results but accuracy isn't so great ... <sk>

If scaleFloat and exponent are implemented with bit twiddling they can
be quite fast.
I have a feeling that they involve slow FFI calls in GHC (being the
original author of a lot of the code involved).
On Thu, Jul 17, 2008 at 8:21 PM, stefan kersten
On 17.07.2008, at 17:18, Henning Thielemann wrote:
i've attached an example program which seems to indicate that the magnitude function from Data.Complex is very slow compared to a more naive implementation (for Complex Float). on my machine (intel core2 duo, osx 10.4) the CPU time using the library function is about 6-7 times as much as when using the other function. any ideas what might be going on? any flaws in my measurement code?
Complex.magnitude must prevent overflows, that is, if you just square 1e200::Double you get an overflow, although the end result may be also around 1e200. I guess, that to this end Complex.magnitude will separate mantissa and exponent, but this is done via Integers, I'm afraid.
very enlightening, thanks! it might be possible to (almost) get the best of two worlds (ported from dejagnu's libm):
c_magnitude4 :: Complex Float -> Float c_magnitude4 (x:+y) = if x' < y' then mag y' x' else mag x' y' where x' = abs x y' = abs y sqr x = x * x mag a 0 = a mag a b = a * sqrt (1 + sqr (b/a))
is fast and doesn't overflow intermediate results but accuracy isn't so great ...
<sk>
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Fri, 18 Jul 2008, stefan kersten wrote:
On 17.07.2008, at 21:46, Lennart Augustsson wrote:
If scaleFloat and exponent are implemented with bit twiddling they can be quite fast.
is there a way in ghc to 'cast' between float/int32 and double/int64 (without going through memory)?
What speaks against going through memory? If you want to write a large chunk of Doubles to disk, you may write it to a StorableVector and write that to disk. However shipping internal bit patterns to disk may violate portability, because on different machines there can be different floating point formats. What other situations might be there, where you want to cast Float to Int32 and so on?

On 18.07.2008, at 13:05, Henning Thielemann wrote:
On Fri, 18 Jul 2008, stefan kersten wrote:
On 17.07.2008, at 21:46, Lennart Augustsson wrote:
If scaleFloat and exponent are implemented with bit twiddling they can be quite fast.
is there a way in ghc to 'cast' between float/int32 and double/ int64 (without going through memory)?
What speaks against going through memory? If you want to write a large chunk of Doubles to disk, you may write it to a StorableVector and write that to disk. However shipping internal bit patterns to disk may violate portability, because on different machines there can be different floating point formats. What other situations might be there, where you want to cast Float to Int32 and so on?
for implementing scaleFloat and exponent on the bitlevel as lennart suggested, it would be preferable if the cast would be compiled to a register move/reinterpretation, rather than a store/load through several layers of abstraction (such as Data.Binary). i'm just curious ;) <sk>

On Fri, 18 Jul 2008, stefan kersten wrote:
for implementing scaleFloat and exponent on the bitlevel as lennart suggested, it would be preferable if the cast would be compiled to a register move/reinterpretation, rather than a store/load through several layers of abstraction (such as Data.Binary). i'm just curious ;)
Ok, but then you need a check, whether the numbers on your machine are in IEEE format. This check could be done when configuring 'base' package. The current implementation in GHC.Float could be used as fall-back.

On Fri, 18 Jul 2008, stefan kersten wrote:
On 17.07.2008, at 21:46, Lennart Augustsson wrote:
If scaleFloat and exponent are implemented with bit twiddling they can be quite fast.
is there a way in ghc to 'cast' between float/int32 and double/ int64 (without going through memory)?
I read this as "Is there any way to take a 32-bit float in a register and end up with a 32-bit int in a register, without going through memory, in GHC? How about the other way around? How about 64-bit floats and integers?" It is rather hard to do portably in GHC what some hardware does not do. A long time ago hardware architects decided that separating the integer and floating point register files was a Good Thing. The machine I know best is SPARC, where there are no instructions to move data directly between integer registers and floating point registers. On the other hand, there are other pressures, notably graphics. While integer and floating point registers may be kept separate, there are often "vector" instructions that let you operate on integers in the floating- point registers. (Such as the VIS instructions on SPARCs, which are technically optional.) Exploiting such instructions tends to be rather non- portable. All things considered, I wouldn't worry about it too much: the memory in question should be at the top of the stack, so should be in L1 cache, so should be pretty fast to access, and other things may be more significant.

On Jul 20, 2008, at 8:11 PM, Richard A. O'Keefe wrote:
I read this as
"Is there any way to take a 32-bit float in a register and end up with a 32-bit int in a register, without going through memory, in GHC? How about the other way around? How about 64-bit floats and integers?"
It is rather hard to do portably in GHC what some hardware does not do.
The reason to provide this as a primitive is so that code like Complex magnitude doesn't have to go through a completely opaque-to-the- compiler interface in order to extract bits from the underlying IEEE float representation. Sure, code generation can't assign it to a register, but it could be kept on the stack. Once we're peeking and poking through a Ptr (the trick that was suggested the last time this came up, too, if I remember rightly) we're sunk---GHC doesn't reason well at all about this stuff, and we need to allocate a Ptr to peek and poke. If instead we use a foreign call, again, it's completely compiler-opaque.
A long time ago hardware architects decided that separating the integer and floating point register files was a Good Thing. The machine I know best is SPARC, where there are no instructions to move data directly between integer registers and floating point registers.
Ironically, this is no longer true on Niagara-class (T1 and T2) SPARC machines. So nowadays this may be the only architecture that *does* provide this ability. But again, the really important thing is that the compiler can see that bits is bits and that the conversion involves no magic and no extra storage beyond a spill location on the stack.
... [stuff about SIMD snipped]...
All things considered, I wouldn't worry about it too much: the memory in question should be at the top of the stack, so should be in L1 cache, so should be pretty fast to access, and other things may be more significant.
But only if GHC can treat them transparently---which at the moment it evidently can't! All it'd take is a primitive from Float# -> Int32# and another from Double# -> Int64#... -Jan-Willem Maessen
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

sk:
On 17.07.2008, at 21:46, Lennart Augustsson wrote:
If scaleFloat and exponent are implemented with bit twiddling they can be quite fast.
is there a way in ghc to 'cast' between float/int32 and double/int64 (without going through memory)?
Yeah, "fromIntegral/Int->Float" fromIntegral = int2Float "truncate/Float->Int" truncate = float2Int "truncate/Double->Int" truncate = double2Int with appropriate no-op fromIntegrals for the 32 and 64 variants.

On 18.07.2008, at 19:47, Don Stewart wrote:
sk:
On 17.07.2008, at 21:46, Lennart Augustsson wrote:
If scaleFloat and exponent are implemented with bit twiddling they can be quite fast.
is there a way in ghc to 'cast' between float/int32 and double/int64 (without going through memory)?
Yeah,
"fromIntegral/Int->Float" fromIntegral = int2Float "truncate/Float->Int" truncate = float2Int
"truncate/Double->Int" truncate = double2Int
with appropriate no-op fromIntegrals for the 32 and 64 variants.
i actually meant something like *(uint32_t*)fptr, but thanks anyway ;) <sk>

sk:
On 18.07.2008, at 19:47, Don Stewart wrote:
sk:
On 17.07.2008, at 21:46, Lennart Augustsson wrote:
If scaleFloat and exponent are implemented with bit twiddling they can be quite fast.
is there a way in ghc to 'cast' between float/int32 and double/int64 (without going through memory)?
Yeah,
"fromIntegral/Int->Float" fromIntegral = int2Float "truncate/Float->Int" truncate = float2Int
"truncate/Double->Int" truncate = double2Int
with appropriate no-op fromIntegrals for the 32 and 64 variants.
i actually meant something like *(uint32_t*)fptr, but thanks anyway ;)
ah well, these should compile into as-close-to-noops as possible.

On Thu, 17 Jul 2008, stefan kersten wrote:
On 17.07.2008, at 17:18, Henning Thielemann wrote:
i've attached an example program which seems to indicate that the magnitude function from Data.Complex is very slow compared to a more naive implementation (for Complex Float). on my machine (intel core2 duo, osx 10.4) the CPU time using the library function is about 6-7 times as much as when using the other function. any ideas what might be going on? any flaws in my measurement code?
Complex.magnitude must prevent overflows, that is, if you just square 1e200::Double you get an overflow, although the end result may be also around 1e200. I guess, that to this end Complex.magnitude will separate mantissa and exponent, but this is done via Integers, I'm afraid.
very enlightening, thanks! it might be possible to (almost) get the best of two worlds (ported from dejagnu's libm):
c_magnitude4 :: Complex Float -> Float c_magnitude4 (x:+y) = if x' < y' then mag y' x' else mag x' y' where x' = abs x y' = abs y sqr x = x * x mag a 0 = a mag a b = a * sqrt (1 + sqr (b/a))
is fast and doesn't overflow intermediate results but accuracy isn't so great ...
Yes, that's also a possible work-around. But division is quite slow.
participants (7)
-
Don Stewart
-
Henning Thielemann
-
Ian Lynagh
-
Jan-Willem Maessen
-
Lennart Augustsson
-
Richard A. O'Keefe
-
stefan kersten