The cost of generality, or how expensive is realToFrac?

First, to anyone who recognizes me by name, thanks to the help I've been getting here I've managed to put together a fairly complex set of code files that all compile together nicely, run, and do exactly what I wanted them to do. Success! The trouble is that my implementation is dog slow Fortunately, this isn't the first time I've been in over my head and I started by putting up some simpler scaffolding- which runs much more quickly. Working backwards, it looks like the real bottle neck is in the data types I've created, the type variables I've introduced, and the conversion code I needed to insert to make it all happy. I'm not sure it helps, but I've attached a trimmed down version of the relevant code. What should be happening is my pair is being converted to the canonical form for Coord2D which is Cartesian2D and then converted again to Vertex2. There shouldn't be any change made to the values, they're only being handed from one container to another in this case (Polar coordinates would require computation, but I've stripped that out for the time being). However, those handoffs require calls to realToFrac to make the type system happy, and that has to be what is eating up all my CPU. I think there are probably 4 calls to realToFrac. If I walk through the code, the result, given the pair p, should be: Vertex2 (realToFrac (realToFrac (fst p))) (realToFrac (realToFrac (snd p))) I'd like to maintain type independence if possible, but I expect most uses of this code to feed Doubles in for processing and probably feed GLclampf (Floats, I believe) to the OpenGL layer. If there's a way to do so, I wouldn't mind optimizing for that particular set of types. I've tried GLdouble, and it doesn't really improve things with the current code. Is there a way to short circuit those realToFrac calls if we know the input and output are the same type? Is there a way merge the nested calls? Any other thoughts on what I can do here? The slow down between the two implementations is at least 20x, which seems like a steep penalty to pay. And while I'm at it, is turning on FlexibleInstances the only way to create an instance for (a,a)? Thanks-- Greg --I want to scatter plot a list of pairs of Doubles data ScatterPlot = ScatterPlot {scatterPoints :: [(Double,Double)] } --this way is plenty fast GL.renderPrimitive GL.Points $ mapM_ GL.vertex (map pair2vertex $ scatterPoints plot) --using this function in the map pair2vertex :: (a,a) -> GL.Vertex2 a pair2vertex (x,y) = GL.Vertex2 x y --then I started to get fancy and build custom types for coordinates, hoping to start developing a library useful for --cartesian, polar, 3D, etc. I needed to create this trivial function to resolve a type ambiguity coordToVertex2 :: Coord2D a => a -> (GL.Vertex2 GL.GLclampf) coordToVertex2 = coordToCoord2D --which I call instead of pair2vertex GL.renderPrimitive GL.Points $ mapM_ GL.vertex (map coordToVertex2 $ scatterPoints plot) --Coord2D is a typeclass I created to hold 2D data data Cartesian2D a = Cartesian2D a a deriving (Show, Eq, Read) class Coord2D a where xComponent :: (RealFloat b) => a -> b yComponent :: (RealFloat b) => a -> b toCartesian2D :: (RealFloat b) => a -> Cartesian2D b toCartesian2D p = Cartesian2D (xComponent p) (yComponent p) fromCartesian2D :: (RealFloat b) => Cartesian2D b -> a --and this function allows conversion between coordinate representations coordToCoord2D :: (Coord2D a, Coord2D b) => a -> b coordToCoord2D = fromCartesian2D . toCartesian2D --I think the only other interesting bit of code is the instance definitions: {- Pair instances -} instance (RealFloat a, RealFloat b) => Coord2D (a,b) where xComponent = realToFrac . fst yComponent = realToFrac . snd fromCartesian2D p = ((xComponent p),(yComponent p)) {- Cartesian 2D instances -} instance (RealFloat a) => Coord2D (Cartesian2D a) where xComponent (Cartesian2D x _) = realToFrac x yComponent (Cartesian2D _ y) = realToFrac y fromCartesian2D p = Cartesian2D (xComponent p) (yComponent p) {- Vertex2 instance -} instance (RealFloat a) => Coord2D (Vertex2 a) where xComponent (Vertex2 x _) = realToFrac x yComponent (Vertex2 _ y) = realToFrac y fromCartesian2D p = Vertex2 (xComponent p) (yComponent p)

On Wednesday 15 September 2010 02:51:01, Greg wrote:
First, to anyone who recognizes me by name, thanks to the help I've been getting here I've managed to put together a fairly complex set of code files that all compile together nicely, run, and do exactly what I wanted them to do. Success!
The trouble is that my implementation is dog slow
Fortunately, this isn't the first time I've been in over my head and I started by putting up some simpler scaffolding- which runs much more quickly. Working backwards, it looks like the real bottle neck is in the data types I've created, the type variables I've introduced, and the conversion code I needed to insert to make it all happy.
I'm not sure it helps, but I've attached a trimmed down version of the relevant code. What should be happening is my pair is being converted to the canonical form for Coord2D which is Cartesian2D and then converted again to Vertex2. There shouldn't be any change made to the values, they're only being handed from one container to another in this case (Polar coordinates would require computation, but I've stripped that out for the time being). However, those handoffs require calls to realToFrac to make the type system happy, and that has to be what is eating up all my CPU.
Not all, but probably a big chunk of it. The problem is that the default implementation of realToFrac is realToFrac = fromRational . toRational a) with that implementation, realToFrac :: Double -> Double is not the identity (doesn't respect NaNs) b) it's slow, there are no special operations to convert Double, Float etc. from/to Rational. For a lot of types, GHC provides rewrite rules (you need to compile with optimisations to have them fire) which give faster versions (with somewhat different behaviour, e.g. realToFrac :: Double -> Double is rewritten to id, realToFrac between Float and Double uses primitive widening/narrowing ops, for several newtype wrappers around Float/Double there are rules too).
I think there are probably 4 calls to realToFrac. If I walk through the code, the result, given the pair p, should be: Vertex2 (realToFrac (realToFrac (fst p))) (realToFrac (realToFrac (snd p)))
I'd like to maintain type independence if possible, but I expect most uses of this code to feed Doubles in for processing and probably feed GLclampf (Floats, I believe)
newtype wrapper around CFloat, which is a newtype wrapper around Float Unfortunately, there are no rewrite rules in the module where it is defined, apparently neither any other module that has access to the constructor. And the constructor is not accessible from any of the exposed modules, so as far as I know, you can't provide your own rewrite rules.
to the OpenGL layer. If there's a way to do so, I wouldn't mind optimizing for that particular set of types. I've tried GLdouble, and it doesn't really improve things with the current code.
Is there a way to short circuit those realToFrac calls if we know the input and output are the same type? Is there a way merge the nested calls?
You can try rewrite rules {-# RULES "realToFrac2/realToFrac" realToFrac . realToFrac = realToFrac "realToFrac/id" realToFrac = id #-} but I'm afraid the second won't work at all, then you'd have to specify all interesting cases yourself (there are rules for the cases Double -> Double and Float -> Float in GHC.Float, rules for converting from/to CFloat and CDouble in Foreign.C.Types, so those should be fine too) "realToFrac/GLclampf->GLclampf" realToFrac = id :: GLclampf -> GLclampf and what ese you need. Whether the first one will help (or even work), I don't know either, you have to try.
Any other thoughts on what I can do here? The slow down between the two implementations is at least 20x, which seems like a steep penalty to pay.
In case of emergency, put the needed rewrite rules into the source of OpenGLRaw yourself.
And while I'm at it, is turning on FlexibleInstances the only way to create an instance for (a,a)?
Yes. Haskell98 doesn't allow such instance declarations, so you need the extension.

Hey, thanks, Daniel. I hadn't come across rewrite rules yet. They definitely look like something worth learning, though I'm not sure I'm prepared to start making custom versions of OpenGL.Raw... It looks like I managed to put that battle off for another day, however. I did look at how realToFrac is implemented and (as you mention) it does the fromRational . toRational transform pair suggested in a number of sources, including Real World Haskell. Looking at what toRational is doing, creating a ratio of integers out of a float it seems like a crazy amount of effort to go through just to convert floating point numbers. Looking at the RealFloat class rather that Real and Fractional, it seems like this is a much more efficient way to go: floatToFloat :: (RealFloat a, RealFloat b) => a -> b floatToFloat = (uncurry encodeFloat) . decodeFloat I substituted this in for realToFrac and I'm back to close to my original performance. Playing with a few test cases in ghci, it looks numerically equivalent to realToFrac. This begs the question though-- am I doing something dangerous here? Why isn't this the standard approach? If I understand what's happening, decodeFloat and encodeFloat are breaking the floating point numbers up into their constituent parts-- presumably by bit masking the raw binary. That would explain the performance improvement. I suppose there is some implementation dependence here, but as long as the encode and decode are implemented as a matched set then I think I'm good. Cheers-- Greg On Sep 15, 2010, at 1:56 AM, Daniel Fischer wrote:
On Wednesday 15 September 2010 02:51:01, Greg wrote:
First, to anyone who recognizes me by name, thanks to the help I've been getting here I've managed to put together a fairly complex set of code files that all compile together nicely, run, and do exactly what I wanted them to do. Success!
The trouble is that my implementation is dog slow
Fortunately, this isn't the first time I've been in over my head and I started by putting up some simpler scaffolding- which runs much more quickly. Working backwards, it looks like the real bottle neck is in the data types I've created, the type variables I've introduced, and the conversion code I needed to insert to make it all happy.
I'm not sure it helps, but I've attached a trimmed down version of the relevant code. What should be happening is my pair is being converted to the canonical form for Coord2D which is Cartesian2D and then converted again to Vertex2. There shouldn't be any change made to the values, they're only being handed from one container to another in this case (Polar coordinates would require computation, but I've stripped that out for the time being). However, those handoffs require calls to realToFrac to make the type system happy, and that has to be what is eating up all my CPU.
Not all, but probably a big chunk of it. The problem is that the default implementation of realToFrac is
realToFrac = fromRational . toRational
a) with that implementation, realToFrac :: Double -> Double is not the identity (doesn't respect NaNs) b) it's slow, there are no special operations to convert Double, Float etc. from/to Rational.
For a lot of types, GHC provides rewrite rules (you need to compile with optimisations to have them fire) which give faster versions (with somewhat different behaviour, e.g. realToFrac :: Double -> Double is rewritten to id, realToFrac between Float and Double uses primitive widening/narrowing ops, for several newtype wrappers around Float/Double there are rules too).
I think there are probably 4 calls to realToFrac. If I walk through the code, the result, given the pair p, should be: Vertex2 (realToFrac (realToFrac (fst p))) (realToFrac (realToFrac (snd p)))
I'd like to maintain type independence if possible, but I expect most uses of this code to feed Doubles in for processing and probably feed GLclampf (Floats, I believe)
newtype wrapper around CFloat, which is a newtype wrapper around Float
Unfortunately, there are no rewrite rules in the module where it is defined, apparently neither any other module that has access to the constructor. And the constructor is not accessible from any of the exposed modules, so as far as I know, you can't provide your own rewrite rules.
to the OpenGL layer. If there's a way to do so, I wouldn't mind optimizing for that particular set of types. I've tried GLdouble, and it doesn't really improve things with the current code.
Is there a way to short circuit those realToFrac calls if we know the input and output are the same type? Is there a way merge the nested calls?
You can try rewrite rules
{-# RULES "realToFrac2/realToFrac" realToFrac . realToFrac = realToFrac "realToFrac/id" realToFrac = id #-}
but I'm afraid the second won't work at all, then you'd have to specify all interesting cases yourself (there are rules for the cases Double -> Double and Float -> Float in GHC.Float, rules for converting from/to CFloat and CDouble in Foreign.C.Types, so those should be fine too) "realToFrac/GLclampf->GLclampf" realToFrac = id :: GLclampf -> GLclampf and what ese you need. Whether the first one will help (or even work), I don't know either, you have to try.
Any other thoughts on what I can do here? The slow down between the two implementations is at least 20x, which seems like a steep penalty to pay.
In case of emergency, put the needed rewrite rules into the source of OpenGLRaw yourself.
And while I'm at it, is turning on FlexibleInstances the only way to create an instance for (a,a)?
Yes. Haskell98 doesn't allow such instance declarations, so you need the extension.

On Wednesday 15 September 2010 20:50:13, Greg wrote:
I hadn't come across rewrite rules yet. They definitely look like something worth learning,
Absolutely. GHC's optimiser is good, but there are a lot of cases where you need to push it via rewrite rules if you write polymorphic code or if you want to eliminate intermediate data structures (e.g. list fusion).
though I'm not sure I'm prepared to start making custom versions of OpenGL.Raw...
Yes, if you can work around the issues without that, it's better to leave it in peace :) Though you might ask the maintainer for rewrite rules.
It looks like I managed to put that battle off for another day, however. I did look at how realToFrac is implemented and (as you mention) it does the fromRational . toRational transform pair suggested in a number of sources, including Real World Haskell. Looking at what toRational is doing, creating a ratio of integers out of a float it seems like a crazy amount of effort to go through just to convert floating point numbers.
I just did some benchmarking. I benchmarked foldl' (+) 0 [convert (1 / intToDoub k) | k <- [1 .. 100000]] where intToDoub :: Int -> Double intToDoub = fromIntegral for several functions convert :: Double -> Float (actually, the type has been (RealFloat a, RealFloat b) => a -> b, but it was used at a = Double, b = Float). Everything was compiled with -O2, so the rewrite rules fired, in particular intToDoub was replaced by a primop (int2Double#), so that one's ultra cheap. For convert = realToFrac (hence by the rewrite rules the primop double2Float# was used), I got pretty good times, mean was 6.76 ms. For convert = floatToFloat from below, the times were not too bad, with a mean of 26.3 ms. A factor of roughly four for this benchmark (the factor for the conversion itself will be higher, but not exorbitantly) means it's usable in many situations, but not in performance critical situations where the conversion takes a significant amount of the running time. If you're converting to draw stuff with OpenGL (or some other graphics library), the conversion will take only a relatively small part of the time, so it's fine. For convert = fromRational . toRational (so no rewrite rules), the times were rather appalling: mean was 3.34 seconds. A factor of nearly 500 versus double2Float#. toRational is bad. Looking at instance Real Double where toRational x = (m%1)*(b%1)^^n where (m,n) = decodeFloat x b = floatRadix x (same for Float), and the implementations of (^^) and (^), {-# SPECIALISE (^) :: Integer -> Integer -> Integer, Integer -> Int -> Integer, Int -> Int -> Int #-} (^) :: (Num a, Integral b) => a -> b -> a x0 ^ y0 | y0 < 0 = error "Negative exponent" | y0 == 0 = 1 | otherwise = f x0 y0 where -- f : x0 ^ y0 = x ^ y f x y | even y = f (x * x) (y `quot` 2) | y == 1 = x | otherwise = g (x * x) ((y - 1) `quot` 2) x -- g : x0 ^ y0 = (x ^ y) * z g x y z | even y = g (x * x) (y `quot` 2) z | y == 1 = x * z | otherwise = g (x * x) ((y - 1) `quot` 2) (x * z) -- | raise a number to an integral power {-# SPECIALISE (^^) :: Rational -> Int -> Rational #-} (^^) :: (Fractional a, Integral b) => a -> b -> a x ^^ n = if n >= 0 then x^n else recip (x^(negate n)) together with the multiplication and recip for Rationals, I have to say ouch! There's no special implementation and rewrite rule for powers of Rationals, so on each multiplication in (^), the gcd of numerator and denominator is calculated, *although as powers of the original numerator and denominator they are guaranteed to be coprime*. Considering how slow a division of Integers is, awwwwwww noooooo. So let's look at a better implementation of toRational: toRat :: RealFloat a => a -> Rational toRat x = case decodeFloat x of (m,e) -> case floatRadix x of b -> if e < 0 then (m % (b^(negate e))) else (m * b^e) :% 1 (inlined a better implementation of powers for Rationals). Benchmarking convert = fromRational . toRat show a significant improvement, the mean dropped to 2.75 seconds. Still appalling, but it's a nice improvement and I don't see any quick opportunities to improve that conversion. So let's come to the last, fromRational. That's a compicated function, and unfortunately it has to be and I've no good idea to improve it. fromRational is really evil (in terms of clock cycles). Replacing fromRational with a dummy that just forces the evaluation of its argument and returns NaN, ±Infinity, or 0 for all real Rational values, dummy . toRational had a mean of 623.5 ms and dummy . toRat had a mean of 200.7 ms. So toRat is a jolly good improvement over toRational, but it's still awfully slow. And since fromRational takes much much longer anyway, it's a not too impressive gain for realToFrac.
Looking at the RealFloat class rather that Real and Fractional, it seems like this is a much more efficient way to go:
floatToFloat :: (RealFloat a, RealFloat b) => a -> b floatToFloat = (uncurry encodeFloat) . decodeFloat
Yes, that's much more efficient, as witnessed by the benchmark results. But.
I substituted this in for realToFrac and I'm back to close to my original performance. Playing with a few test cases in ghci, it looks numerically equivalent to realToFrac.
This begs the question though--
No. Sorry, but I can't bear that misuse: http://en.wikipedia.org/wiki/Begging_the_question It raises/demands/invites the question, but it doesn't beg it.
am I doing something dangerous here?
Yes and no.
Why isn't this the standard approach?
Because it will wreak unspeakable havoc when someone creates a RealFloat instance with a floatRadix > 2. A floatRadix of 10 or some power of 2 (16, 256?) could be even reasonable. But for conversions between RealFloat types with the same floatRadix, it's sort of okay, only it clobbers NaNs and (for some conversions) Infinities. However, realToFrac does that too.
If I understand what's happening, decodeFloat and encodeFloat are breaking the floating point numbers up into their constituent parts-- presumably by bit masking the raw binary.
Probably.
That would explain the performance improvement. I suppose there is some implementation dependence here, but as long as the encode and decode are implemented as a matched set then I think I'm good.
Not entirely matched, for Float -> Float and Double -> Double, NaN -> -Infinity, maybe denormalized values break too. ±Infinity is mapped to a finite value at Float -> Double But since toRational uses decodeFloat and fromRational uses encodeFloat, floatToFloat is in that respect no worse than realToFrac without rewrite rules.
Cheers-- Greg

Wow, thanks for all the analysis on this, Daniel! So, I think the summary of your evaluation is that realToFrac does just fine in almost all cases due to careful optimization and outperforms my floatToFloat trick (which answers the question I flagrantly begged... ;~) except that I can't access those optimizations because the OpenGL types are hidden behind newtypes and buried in a library. I've confirmed your results to make sure I could. I went a step further and typed convert for GLclampf expecting to see unoptimized performance, but it ran just as fast as Double->Float. convert :: Double -> GL.GLclampf convert = realToFrac And still ran faster than floatToFloat. However there's no denying that floatToFloat runs *much* faster than realToFrac in the larger application. Profiling shows floatToFloat taking about 50% of my CPU time, but the frame rate is close to (though not quite) 30 fps. Using realToFrac takes seconds just to display the first frame. As it stands, floatToFloat is responsible for 50% of my CPU time which I think is mainly a tribute the OpenGL implementation, but still makes it the obvious target for optimization. Having gotten all excited about the new tool in my box, I wrote the following rule: {-# RULES "floatToFloat/id" floatToFloat=id "floatToFloat x2" floatToFloat . floatToFloat = floatToFloat #-} Neither of which seems to fires in this application, but I did get the first one to fire by importing the same file into my benchmark. The next obvious step is to optimize a level up in the call hierarchy, and rewrite coordToCoord2D since I know my original pair2vertex function was faster. So, I added this rule in the file where Vertex2 is made an instance of Coord2D: {-# RULES "coordToCoord2D/p2v2" coordToCoord2D = pair2vertex #-} which refers to two functions, each in different files (I don't think that matters, but mention it just in case) pair2vertex :: (Num a) => (a,a) -> GL.Vertex2 a pair2vertex (x,y) = GL.Vertex2 x y coordToCoord2D :: (Coord2D a, Coord2D b) => a -> b coordToCoord2D = fromCartesian2D . toCartesian2D directly after my coordToCoord2D definition I have this rule as well: {-# RULES "coordToCoord2D/id" coordToCoord2D = id "coordToCoord2D x2" coordToCoord2D . coordToCoord2D = coordToCoord2D #-} I get a compile time error that I can't make sense of. It's asking me to put a context on my rule, but I can't find any references on how to do that... ----------- Could not deduce (Num a) from the context (Coord2D (Vertex2 a), Coord2D (a, a)) arising from a use of `pair2vertex' at GCB/OpenGL/Geometry.hs:32:40-50 Possible fix: add (Num a) to the context of the RULE "coordToCoord2D/p2v2" In the expression: pair2vertex When checking the transformation rule "coordToCoord2D/p2v2" ----------- The file:line:column is the "pair2vertex" token in the rule I list above. Cheers-- Greg On Sep 15, 2010, at 3:36 PM, Daniel Fischer wrote:
On Wednesday 15 September 2010 20:50:13, Greg wrote:
I hadn't come across rewrite rules yet. They definitely look like something worth learning,
Absolutely. GHC's optimiser is good, but there are a lot of cases where you need to push it via rewrite rules if you write polymorphic code or if you want to eliminate intermediate data structures (e.g. list fusion).
though I'm not sure I'm prepared to start making custom versions of OpenGL.Raw...
Yes, if you can work around the issues without that, it's better to leave it in peace :) Though you might ask the maintainer for rewrite rules.
It looks like I managed to put that battle off for another day, however. I did look at how realToFrac is implemented and (as you mention) it does the fromRational . toRational transform pair suggested in a number of sources, including Real World Haskell. Looking at what toRational is doing, creating a ratio of integers out of a float it seems like a crazy amount of effort to go through just to convert floating point numbers.
I just did some benchmarking.
I benchmarked
foldl' (+) 0 [convert (1 / intToDoub k) | k <- [1 .. 100000]]
where
intToDoub :: Int -> Double intToDoub = fromIntegral
for several functions
convert :: Double -> Float (actually, the type has been (RealFloat a, RealFloat b) => a -> b, but it was used at a = Double, b = Float). Everything was compiled with -O2, so the rewrite rules fired, in particular intToDoub was replaced by a primop (int2Double#), so that one's ultra cheap.
For convert = realToFrac (hence by the rewrite rules the primop double2Float# was used), I got pretty good times, mean was 6.76 ms.
For convert = floatToFloat from below, the times were not too bad, with a mean of 26.3 ms. A factor of roughly four for this benchmark (the factor for the conversion itself will be higher, but not exorbitantly) means it's usable in many situations, but not in performance critical situations where the conversion takes a significant amount of the running time. If you're converting to draw stuff with OpenGL (or some other graphics library), the conversion will take only a relatively small part of the time, so it's fine.
For convert = fromRational . toRational (so no rewrite rules), the times were rather appalling: mean was 3.34 seconds. A factor of nearly 500 versus double2Float#.
toRational is bad. Looking at
instance Real Double where toRational x = (m%1)*(b%1)^^n where (m,n) = decodeFloat x b = floatRadix x
(same for Float), and the implementations of (^^) and (^),
{-# SPECIALISE (^) :: Integer -> Integer -> Integer, Integer -> Int -> Integer, Int -> Int -> Int #-} (^) :: (Num a, Integral b) => a -> b -> a x0 ^ y0 | y0 < 0 = error "Negative exponent" | y0 == 0 = 1 | otherwise = f x0 y0 where -- f : x0 ^ y0 = x ^ y f x y | even y = f (x * x) (y `quot` 2) | y == 1 = x | otherwise = g (x * x) ((y - 1) `quot` 2) x -- g : x0 ^ y0 = (x ^ y) * z g x y z | even y = g (x * x) (y `quot` 2) z | y == 1 = x * z | otherwise = g (x * x) ((y - 1) `quot` 2) (x * z)
-- | raise a number to an integral power {-# SPECIALISE (^^) :: Rational -> Int -> Rational #-} (^^) :: (Fractional a, Integral b) => a -> b -> a x ^^ n = if n >= 0 then x^n else recip (x^(negate n))
together with the multiplication and recip for Rationals, I have to say ouch! There's no special implementation and rewrite rule for powers of Rationals, so on each multiplication in (^), the gcd of numerator and denominator is calculated, *although as powers of the original numerator and denominator they are guaranteed to be coprime*. Considering how slow a division of Integers is, awwwwwww noooooo.
So let's look at a better implementation of toRational:
toRat :: RealFloat a => a -> Rational toRat x = case decodeFloat x of (m,e) -> case floatRadix x of b -> if e < 0 then (m % (b^(negate e))) else (m * b^e) :% 1
(inlined a better implementation of powers for Rationals).
Benchmarking convert = fromRational . toRat show a significant improvement, the mean dropped to 2.75 seconds. Still appalling, but it's a nice improvement and I don't see any quick opportunities to improve that conversion.
So let's come to the last, fromRational. That's a compicated function, and unfortunately it has to be and I've no good idea to improve it. fromRational is really evil (in terms of clock cycles). Replacing fromRational with a dummy that just forces the evaluation of its argument and returns NaN, ±Infinity, or 0 for all real Rational values,
dummy . toRational had a mean of 623.5 ms and dummy . toRat had a mean of 200.7 ms.
So toRat is a jolly good improvement over toRational, but it's still awfully slow. And since fromRational takes much much longer anyway, it's a not too impressive gain for realToFrac.
Looking at the RealFloat class rather that Real and Fractional, it seems like this is a much more efficient way to go:
floatToFloat :: (RealFloat a, RealFloat b) => a -> b floatToFloat = (uncurry encodeFloat) . decodeFloat
Yes, that's much more efficient, as witnessed by the benchmark results. But.
I substituted this in for realToFrac and I'm back to close to my original performance. Playing with a few test cases in ghci, it looks numerically equivalent to realToFrac.
This begs the question though--
No. Sorry, but I can't bear that misuse: http://en.wikipedia.org/wiki/Begging_the_question
It raises/demands/invites the question, but it doesn't beg it.
am I doing something dangerous here?
Yes and no.
Why isn't this the standard approach?
Because it will wreak unspeakable havoc when someone creates a RealFloat instance with a floatRadix > 2. A floatRadix of 10 or some power of 2 (16, 256?) could be even reasonable.
But for conversions between RealFloat types with the same floatRadix, it's sort of okay, only it clobbers NaNs and (for some conversions) Infinities. However, realToFrac does that too.
If I understand what's happening, decodeFloat and encodeFloat are breaking the floating point numbers up into their constituent parts-- presumably by bit masking the raw binary.
Probably.
That would explain the performance improvement. I suppose there is some implementation dependence here, but as long as the encode and decode are implemented as a matched set then I think I'm good.
Not entirely matched, for Float -> Float and Double -> Double, NaN -> -Infinity, maybe denormalized values break too.
±Infinity is mapped to a finite value at Float -> Double
But since toRational uses decodeFloat and fromRational uses encodeFloat, floatToFloat is in that respect no worse than realToFrac without rewrite rules.
Cheers-- Greg

On Thursday 16 September 2010 21:47:43, Greg wrote:
Wow, thanks for all the analysis on this, Daniel!
So, I think the summary of your evaluation is that realToFrac does just fine in almost all cases
I wouldn't say that. realToFrac does fine in those cases where a rewrite rule provides a fast conversion (or one of the types bewteen which you want to convert is Rational, when you get fromRational or toRational - the latter can be improved in several cases).
due to careful optimization and outperforms my floatToFloat trick (which answers the question I flagrantly begged... ;~) except that I can't access those optimizations because the OpenGL types are hidden behind newtypes and buried in a library.
I've confirmed your results to make sure I could.
I went a step further and typed convert for GLclampf expecting to see unoptimized performance, but it ran just as fast as Double->Float.
convert :: Double -> GL.GLclampf convert = realToFrac
It would be interesting to see what core GHC produces for that (you can get the core with the `-ddump-simpl' command line flag [redirect stdout to a file] or with the ghc-core tool [available on hackage]). If it runs as fast as realToFrac :: Double -> Float (with optimisations), GHC must have rewritten realToFrac to double2Float# and it should only do that if there are rewrite rules for GLclampf. In that case, the problem is probably that GHC doesn't see the realToFrac applications because they're too deeply wrapped in your coordToCoord2D calls. If that is the problem, it might help to use {-# INLINE #-} pragmas on coordToCoord2D, fromCartesian2D and toCartesian2D. Can you try with realToFrac and the {-# INLINE #-} pragmas?
And still ran faster than floatToFloat. However there's no denying that floatToFloat runs *much* faster than realToFrac in the larger application. Profiling shows floatToFloat taking about 50% of my CPU
That's too much for my liking, a simple conversion from Double to Float shouldn't take long, even if the Float is wrapped in newtypes (after all, the newtypes don't exist at runtime).
time, but the frame rate is close to (though not quite) 30 fps. Using realToFrac takes seconds just to display the first frame.
As it stands, floatToFloat is responsible for 50% of my CPU time which I think is mainly a tribute the OpenGL implementation, but still makes it the obvious target for optimization. Having gotten all excited about the new tool in my box, I wrote the following rule:
{-# RULES "floatToFloat/id" floatToFloat=id "floatToFloat x2" floatToFloat . floatToFloat = floatToFloat #-}
Neither of which seems to fires in this application,
GHC reports fired rules with -ddump-simpl-stats. Getting rules to fire is a little brittle, GHC does not try too hard to match expressions with rules, and if several rules match, it chooses one arbitrarily, so your rules may have been missed because the actual code looked different (perhaps because other rewrite rules fired first).
but I did get the first one to fire by importing the same file into my benchmark.
The next obvious step is to optimize a level up in the call hierarchy, and rewrite coordToCoord2D since I know my original pair2vertex function was faster. So, I added this rule in the file where Vertex2 is made an instance of Coord2D:
{-# RULES "coordToCoord2D/p2v2" coordToCoord2D = pair2vertex #-}
which refers to two functions, each in different files (I don't think that matters, but mention it just in case)
pair2vertex :: (Num a) => (a,a) -> GL.Vertex2 a pair2vertex (x,y) = GL.Vertex2 x y
coordToCoord2D :: (Coord2D a, Coord2D b) => a -> b coordToCoord2D = fromCartesian2D . toCartesian2D
directly after my coordToCoord2D definition I have this rule as well:
{-# RULES "coordToCoord2D/id" coordToCoord2D = id "coordToCoord2D x2" coordToCoord2D . coordToCoord2D = coordToCoord2D #-}
I get a compile time error that I can't make sense of. It's asking me to put a context on my rule, but I can't find any references on how to do that...
-----------
Could not deduce (Num a) from the context (Coord2D (Vertex2 a), Coord2D (a, a)) arising from a use of `pair2vertex' at GCB/OpenGL/Geometry.hs:32:40-50 Possible fix: add (Num a) to the context of the RULE "coordToCoord2D/p2v2" In the expression: pair2vertex When checking the transformation rule "coordToCoord2D/p2v2"
-----------
The file:line:column is the "pair2vertex" token in the rule I list above.
Yes, there's a Num constraint on pair2vertex, but not on coordToCoord2D, so it's not type correct. You could try removing the Num constraint from pair2vertex or add the constraint to the rule, {-# RULES "coordToCoord2D/p2v2" forall a. Num a => (coordToCoord2D :: (a,a) -> GL.Vertex2 a) = pair2vertex #-} (well, I don't know whether that's the correct way, but you can try).
Cheers-- Greg

It would be interesting to see what core GHC produces for that (you can get the core with the `-ddump-simpl' command line flag [redirect stdout to a file] or with the ghc-core tool [available on hackage]). If it runs as fast as realToFrac :: Double -> Float (with optimisations), GHC must have rewritten realToFrac to double2Float# and it should only do that if there are rewrite rules for GLclampf.
I'm not sure if you literally meant you wanted to see the output or not, but I've attached a zip of the dump files and my simple source file. The dump file naming is cryptic, but the first letters refer to the definition of 'convert' where: fTF: use the floatToFloat function in the source file rTF: use the standard realToFrac fRtR: use (fromRational . toRational) The next three characters indicate the type signature of convert: d2f: Double -> Float d2g: Double -> GL.GLclampf I'd summarize the results, but apparently I took the blue pill and can't make heads or tails of what I'm seeing in the dump format...
In that case, the problem is probably that GHC doesn't see the realToFrac applications because they're too deeply wrapped in your coordToCoord2D calls.
If that is the problem, it might help to use {-# INLINE #-} pragmas on coordToCoord2D, fromCartesian2D and toCartesian2D. Can you try with realToFrac and the {-# INLINE #-} pragmas?
I tried inlining the functions you suggest with little effect. The realToFrac version (in this case I just set floatToFloat=realToFrac to save the search and replace effort) is just too heavily loaded to see any difference at all (98+% of CPU is spent in realToFrac). The same inlining using my definition of floatToFloat gave me a 10% improvement from 50% -> 46% of the CPU spent in floatToFloat and an inverse change in allocation to match. Best I can tell, the inlining is being recognized, but just not changing much.
And still ran faster than floatToFloat. However there's no denying that floatToFloat runs *much* faster than realToFrac in the larger application. Profiling shows floatToFloat taking about 50% of my CPU
That's too much for my liking, a simple conversion from Double to Float shouldn't take long, even if the Float is wrapped in newtypes (after all, the newtypes don't exist at runtime).
Agreed. The rest of the application right now isn't doing a lot of work yet though-- I'm generating (pre-calculating, if Haskell is doing it's job) a list of 360*180 points on a sphere and dumping that to OpenGL which should be doing most of the dirty work in hardware. I'm not entirely sure why floatToFloat recalculates every iteration and isn't just cached, but I'm guessing it's because the floatToFloat is being done in an OpenGL callback within the IO monad. Eventually I'll be providing time-varying data anyway, so the conversions will have to be continuously recalculated in the end. That comes out to 65000 conversions every 30ms, or about 2 million conversions a second. I'd probably just leave it at that except, as you've demonstrated, there is at least a factor of 3 or 4 to be gained somehow-- realToFrac can provide it under the right conditions.
{-# RULES "floatToFloat/id" floatToFloat=id "floatToFloat x2" floatToFloat . floatToFloat = floatToFloat #-}
Neither of which seems to fires in this application,
GHC reports fired rules with -ddump-simpl-stats. Getting rules to fire is a little brittle, GHC does not try too hard to match expressions with rules, and if several rules match, it chooses one arbitrarily, so your rules may have been missed because the actual code looked different (perhaps because other rewrite rules fired first).
Yeah, I've been looking at the -ddump-simp-stats output. If I'm reading the documentation right, rules are enabled simply by invoking ghc with -O or -O2, right? I'm now not convinced any of my rewrite rules are firing-- or at least I can't seem to get them to again.
I get a compile time error that I can't make sense of. It's asking me to put a context on my rule, but I can't find any references on how to do that...
-----------
Could not deduce (Num a) from the context (Coord2D (Vertex2 a), Coord2D (a, a)) arising from a use of `pair2vertex' at GCB/OpenGL/Geometry.hs:32:40-50 Possible fix: add (Num a) to the context of the RULE "coordToCoord2D/p2v2" In the expression: pair2vertex When checking the transformation rule "coordToCoord2D/p2v2"
Yes, there's a Num constraint on pair2vertex, but not on coordToCoord2D, so it's not type correct. You could try removing the Num constraint from pair2vertex or add the constraint to the rule,
{-# RULES "coordToCoord2D/p2v2" forall a. Num a => (coordToCoord2D :: (a,a) -> GL.Vertex2 a) = pair2vertex #-}
(well, I don't know whether that's the correct way, but you can try).
No, that doesn't do it. I tried a few variations on that and it always chokes on the => symbol or whatever other syntax I try to use. The Num constraint was added because it was needed on related functions (3 element vertices where the z was stuffed with 0, for example), so I got rid of those and the Num constraint. Doesn't matter, the rule still doesn't fire... =( Cheers-- Greg

On Sunday 19 September 2010 02:41:36, Greg wrote:
It would be interesting to see what core GHC produces for that (you can get the core with the `-ddump-simpl' command line flag [redirect stdout to a file] or with the ghc-core tool [available on hackage]). If it runs as fast as realToFrac :: Double -> Float (with optimisations), GHC must have rewritten realToFrac to double2Float# and it should only do that if there are rewrite rules for GLclampf.
I'm not sure if you literally meant you wanted to see the output or not,
Yes, but only if you were willing to take the trouble of producing it. I actually was more interested in the core for the real app, but the core for the toy benchmark is already interesting (see below).
but I've attached a zip of the dump files and my simple source file. The dump file naming is cryptic, but the first letters refer to the definition of 'convert' where:
fTF: use the floatToFloat function in the source file rTF: use the standard realToFrac fRtR: use (fromRational . toRational)
The next three characters indicate the type signature of convert:
d2f: Double -> Float d2g: Double -> GL.GLclampf
I'd summarize the results, but apparently I took the blue pill and can't make heads or tails of what I'm seeing in the dump format...
Okay, for the results for the Double -> Float conversion, fromRational . toRational took ~3.35 seconds floatToFloat took 32 ms realToFrac took 8 ms (always compiled with -O2; the times are slightly higher than the criterion benchmarking results from Wednesday/Thursday, that's probably because those ran pre-warmed while today's run-once started up cold [and included a call to getCPUTime]). Now to GLclampf. I remembered that I had installed OpenGL with one of my old GHCs (turned out to be 6.10.3), so I could also run the tests for Double -> GLclampf. Unsurprisingly, fromRational . toRational and floatToFloat had the same performance as for Double -> Float. Equally unsurprisingly, were it not for your results and the core you sent, realToFrac had the same performance as fromRational . toRational. In the core you sent for realToFrac :: Double -> GLclampf, we find the loop for summing a list of GLclampf: Rec { $wlgo_r1wv :: GHC.Prim.Float# -> [Graphics.Rendering.OpenGL.GL.BasicTypes.GLclampf] -> GHC.Prim.Float# GblId [Arity 2 NoCafRefs Str: DmdType LS] $wlgo_r1wv = \ (ww_s1vV :: GHC.Prim.Float#) (w_s1vX :: [Graphics.Rendering.OpenGL.GL.BasicTypes.GLclampf]) -> case w_s1vX of _ { [] -> ww_s1vV; : x_aVE xs_aVF -> case x_aVE of _ { GHC.Types.F# y_a13F -> $wlgo_r1wv (GHC.Prim.plusFloat# ww_s1vV y_a13F) xs_aVF } } end Rec } Wow, did you remove the casting annotations or does it really match a GLclampf against the Float constructor F# without any ado? If the latter, which compiler version have you? Just for the record, 6.10.3 produced the same code, but with several levels of casting from Float to GLclampf. More interesting is the generation of the list: Rec { go_r1wx :: GHC.Prim.Int# -> [Graphics.Rendering.OpenGL.GL.BasicTypes.GLclampf] GblId [Arity 1 NoCafRefs Str: DmdType L] go_r1wx = \ (x_a13o :: GHC.Prim.Int#) -> GHC.Types.: @ Graphics.Rendering.OpenGL.GL.BasicTypes.GLclampf (case GHC.Prim./## 1.0 (GHC.Prim.int2Double# x_a13o) of wild2_a14i { __DEFAULT -> GHC.Types.F# (GHC.Prim.double2Float# wild2_a14i) }) (case x_a13o of wild_B1 { __DEFAULT -> go_r1wx (GHC.Prim.+# wild_B1 1); 100000 -> GHC.Types.[] @ Graphics.Rendering.OpenGL.GL.BasicTypes.GLclampf }) end Rec } Wowwowwow, it conses a Float to a list of GLclampf without even mentioning a cast. Since it feels free to do that, no wonder that it uses double2Float#. Hrm, okay, perhaps a new version of OpenGL[Raw]? Nope, 2.4.0.1 and 1.1.0.1, what I have with 6.10.3. So, perhaps it's 6.12 vs. 6.10? Install OpenGL for 6.12.3, try, nope, same as 6.10.3, the summing is identical except for the casting annotations, but the generation goes through fromRational and toRational [expected, because there are no rewrite rules in OpenGLRaw]. What compiler are you using? HEAD? The core doesn't look like HEAD's core to me, but that might be because nothing except main is exported. Okay, so I threw a couple of rewrite rules into OpenGLRaw, reinstalled and reran, now realToFrac gets properly rewritten to double2Float# (with casts).
In that case, the problem is probably that GHC doesn't see the realToFrac applications because they're too deeply wrapped in your coordToCoord2D calls.
Okay, your compiler *does* rewrite realToFrac :: Double -> GLclampf to double2Float#, at least when the situation is simple enough, although there are no rewrite rules in the package for that. Looks like a fortuitous bug. But it doesn't do the rewriting in the real app, so it's probably indeed too deeply wrapped there.
If that is the problem, it might help to use {-# INLINE #-} pragmas on coordToCoord2D, fromCartesian2D and toCartesian2D. Can you try with realToFrac and the {-# INLINE #-} pragmas?
I tried inlining the functions you suggest with little effect. The realToFrac version (in this case I just set floatToFloat=realToFrac to save the search and replace effort) is just too heavily loaded to see any difference at all (98+% of CPU is spent in realToFrac). The same inlining using my definition of floatToFloat gave me a 10% improvement from 50% -> 46% of the CPU spent in floatToFloat and an inverse change in allocation to match.
Best I can tell, the inlining is being recognized, but just not changing much.
Looking at the Coord stuff more closely, you'd probably need much more inlining to get a good effect. And you probably need a bit more strictness too. ============================================================ --Coord2D is a typeclass I created to hold 2D data data Cartesian2D a = Cartesian2D a a deriving (Show, Eq, Read) -- Needs testing, but I suspect {- data Cartesian2D a = Cartesian2D !a !a deriving (...) or even data Cartesian2D a = Cartesian2D {-# UNPACK #-} !a {-# UNPACK #-} !a deriving (...) -} -- would have a beneficial effect. {- Pair instances -} instance (RealFloat a, RealFloat b) => Coord2D (a,b) where xComponent = realToFrac . fst yComponent = realToFrac . snd fromCartesian2D p = ((xComponent p),(yComponent p)) -- That might be too lazy, perhaps {- xComponent (x,_) = realToFrac x yComponent (_,y) = realToFrac y fromCartesian2D (Cartesian2D x y) = (x,y) -} -- will be better -- anyhow, maybe you need to inline all methods of Coord2D to get the rules to fire: class Coord2D a where {-# INLINE xComponent #-} xComponent :: (RealFloat b) => a -> b {-# INLINE yComponent #-} yComponent :: (RealFloat b) => a -> b {-# INLINE toCartesian2D #-} toCartesian2D :: (RealFloat b) => a -> Cartesian2D b toCartesian2D p = Cartesian2D (xComponent p) (yComponent p) {-# INLINE fromCartesian2D #-} fromCartesian2D :: (RealFloat b) => Cartesian2D b -> a -- I'm rather convinced inlining the component functions will be good, but -- there's a good chance that they're small enough to be inlined anyway. -- The inlining of the to/fromCratesian2D functions is doubtful, because --and this function allows conversion between coordinate representations coordToCoord2D :: (Coord2D a, Coord2D b) => a -> b coordToCoord2D = fromCartesian2D . toCartesian2D -- cries loudly for {-# RULES "toCart/fromCart" forall p. toCartesian2D (fromCartesian2D p) = p #-} -- whenever that's possible -- so, perhaps first try to rewrite, whenever that's possible, afterwards inline, hence -- {-# INLINE [2] toCartesian2D #-} -- {-# INLINE [2] fromCartesian2D #-} -- {-# RULES -- "toCart/fromCart" [~2] forall p. toCartesian (fromCartesian p) = p -- #-} -- dunno whether that works, but -ddump-simpl-stats should tell ============================================================ Finally, there's one other thing to try, with or without rules/inlining: coordToVertex2 :: Coord2D a => a -> (GL.Vertex2 GL.GLclampf) coordToVertex2 = coordToCoord2D GLclampf is a newtype wrapper around a newtype wrapper around Float. Coercing between newtype and original is supposed to be safe, so import Unsafe.Coerce floatToGLclampf :: Float -> GL.GLclampf floatToGLclampf = unsafeCoerce coordToVertex2 c = case coordToCoord2D c of (x,y) -> GL.Vertex2 (floatToGLclampf x) (floatToGLclampf y) That way, we circumvent a potentially expensive call to realToFrac :: a -> GLclampf for a = Double or a = Float and split it into a no-op (unsafeCoerce) and a hopefully cheap conversion to Float.
And still ran faster than floatToFloat. However there's no denying that floatToFloat runs *much* faster than realToFrac in the larger application. Profiling shows floatToFloat taking about 50% of my CPU
That's too much for my liking, a simple conversion from Double to Float shouldn't take long, even if the Float is wrapped in newtypes (after all, the newtypes don't exist at runtime).
Agreed. The rest of the application right now isn't doing a lot of work yet though-- I'm generating (pre-calculating, if Haskell is doing it's job) a list of 360*180 points on a sphere and dumping that to OpenGL which should be doing most of the dirty work in hardware. I'm not entirely sure why floatToFloat recalculates every iteration and isn't just cached,
Code? Maybe you have to give a name for it to be cached.
but I'm guessing it's because the floatToFloat is being done in an OpenGL callback within the IO monad. Eventually I'll be providing time-varying data anyway, so the conversions will have to be continuously recalculated in the end.
That comes out to 65000 conversions every 30ms, or about 2 million conversions a second. I'd probably just leave it at that except, as you've demonstrated, there is at least a factor of 3 or 4 to be gained somehow-- realToFrac can provide it under the right conditions.
{-# RULES "floatToFloat/id" floatToFloat=id "floatToFloat x2" floatToFloat . floatToFloat = floatToFloat #-}
I'm not sure how the rule-spotting works with compositions, whether it matches `foo . bar' with `foo (bar x)' [one in the code, the other in the rule], it might be necessary to give the rule in both forms.
Neither of which seems to fires in this application,
GHC reports fired rules with -ddump-simpl-stats. Getting rules to fire is a little brittle, GHC does not try too hard to match expressions with rules, and if several rules match, it chooses one arbitrarily, so your rules may have been missed because the actual code looked different (perhaps because other rewrite rules fired first).
Yeah, I've been looking at the -ddump-simp-stats output. If I'm reading the documentation right, rules are enabled simply by invoking ghc with -O or -O2, right?
Right, -O implies -fenable-rewrite-rules (and hence -O2 too). On the other hand, you can't have rewrite-rules without -O [that is, you can pass -fenable-rwerite-rules on the command line without -O, it will just have no effect]. Presumably the flag exists for its negation, so you can invoke GHC with -O -fno-enable-rewrite-rules to have the rules not firing.
I'm now not convinced any of my rewrite rules are firing-- or at least I can't seem to get them to again.
If they fire, -ddump-simpl-stats tells you, there's a piece like 9 RuleFired 1 ==#->case 1 ># 1 eftInt 1 fold/build 1 fromIntegral/Int->Double 1 int2Float# 1 realToFrac/Double->Float 1 unpack 1 unpack-list in the dump, if it contains the name of one of your rules, it fired n times, otherwise it didn't fire.
No, that doesn't do it. I tried a few variations on that and it always chokes on the => symbol or whatever other syntax I try to use. The Num constraint was added because it was needed on related functions (3 element vertices where the z was stuffed with 0, for example), so I got rid of those and the Num constraint. Doesn't matter, the rule still doesn't fire... =(
Might have been inlined before the rule got a chance to fire.
Cheers-- Greg
participants (2)
-
Daniel Fischer
-
Greg