
On Sunday 19 September 2010 02:41:36, Greg wrote:
It would be interesting to see what core GHC produces for that (you can get the core with the `-ddump-simpl' command line flag [redirect stdout to a file] or with the ghc-core tool [available on hackage]). If it runs as fast as realToFrac :: Double -> Float (with optimisations), GHC must have rewritten realToFrac to double2Float# and it should only do that if there are rewrite rules for GLclampf.
I'm not sure if you literally meant you wanted to see the output or not,
Yes, but only if you were willing to take the trouble of producing it. I actually was more interested in the core for the real app, but the core for the toy benchmark is already interesting (see below).
but I've attached a zip of the dump files and my simple source file. The dump file naming is cryptic, but the first letters refer to the definition of 'convert' where:
fTF: use the floatToFloat function in the source file rTF: use the standard realToFrac fRtR: use (fromRational . toRational)
The next three characters indicate the type signature of convert:
d2f: Double -> Float d2g: Double -> GL.GLclampf
I'd summarize the results, but apparently I took the blue pill and can't make heads or tails of what I'm seeing in the dump format...
Okay, for the results for the Double -> Float conversion, fromRational . toRational took ~3.35 seconds floatToFloat took 32 ms realToFrac took 8 ms (always compiled with -O2; the times are slightly higher than the criterion benchmarking results from Wednesday/Thursday, that's probably because those ran pre-warmed while today's run-once started up cold [and included a call to getCPUTime]). Now to GLclampf. I remembered that I had installed OpenGL with one of my old GHCs (turned out to be 6.10.3), so I could also run the tests for Double -> GLclampf. Unsurprisingly, fromRational . toRational and floatToFloat had the same performance as for Double -> Float. Equally unsurprisingly, were it not for your results and the core you sent, realToFrac had the same performance as fromRational . toRational. In the core you sent for realToFrac :: Double -> GLclampf, we find the loop for summing a list of GLclampf: Rec { $wlgo_r1wv :: GHC.Prim.Float# -> [Graphics.Rendering.OpenGL.GL.BasicTypes.GLclampf] -> GHC.Prim.Float# GblId [Arity 2 NoCafRefs Str: DmdType LS] $wlgo_r1wv = \ (ww_s1vV :: GHC.Prim.Float#) (w_s1vX :: [Graphics.Rendering.OpenGL.GL.BasicTypes.GLclampf]) -> case w_s1vX of _ { [] -> ww_s1vV; : x_aVE xs_aVF -> case x_aVE of _ { GHC.Types.F# y_a13F -> $wlgo_r1wv (GHC.Prim.plusFloat# ww_s1vV y_a13F) xs_aVF } } end Rec } Wow, did you remove the casting annotations or does it really match a GLclampf against the Float constructor F# without any ado? If the latter, which compiler version have you? Just for the record, 6.10.3 produced the same code, but with several levels of casting from Float to GLclampf. More interesting is the generation of the list: Rec { go_r1wx :: GHC.Prim.Int# -> [Graphics.Rendering.OpenGL.GL.BasicTypes.GLclampf] GblId [Arity 1 NoCafRefs Str: DmdType L] go_r1wx = \ (x_a13o :: GHC.Prim.Int#) -> GHC.Types.: @ Graphics.Rendering.OpenGL.GL.BasicTypes.GLclampf (case GHC.Prim./## 1.0 (GHC.Prim.int2Double# x_a13o) of wild2_a14i { __DEFAULT -> GHC.Types.F# (GHC.Prim.double2Float# wild2_a14i) }) (case x_a13o of wild_B1 { __DEFAULT -> go_r1wx (GHC.Prim.+# wild_B1 1); 100000 -> GHC.Types.[] @ Graphics.Rendering.OpenGL.GL.BasicTypes.GLclampf }) end Rec } Wowwowwow, it conses a Float to a list of GLclampf without even mentioning a cast. Since it feels free to do that, no wonder that it uses double2Float#. Hrm, okay, perhaps a new version of OpenGL[Raw]? Nope, 2.4.0.1 and 1.1.0.1, what I have with 6.10.3. So, perhaps it's 6.12 vs. 6.10? Install OpenGL for 6.12.3, try, nope, same as 6.10.3, the summing is identical except for the casting annotations, but the generation goes through fromRational and toRational [expected, because there are no rewrite rules in OpenGLRaw]. What compiler are you using? HEAD? The core doesn't look like HEAD's core to me, but that might be because nothing except main is exported. Okay, so I threw a couple of rewrite rules into OpenGLRaw, reinstalled and reran, now realToFrac gets properly rewritten to double2Float# (with casts).
In that case, the problem is probably that GHC doesn't see the realToFrac applications because they're too deeply wrapped in your coordToCoord2D calls.
Okay, your compiler *does* rewrite realToFrac :: Double -> GLclampf to double2Float#, at least when the situation is simple enough, although there are no rewrite rules in the package for that. Looks like a fortuitous bug. But it doesn't do the rewriting in the real app, so it's probably indeed too deeply wrapped there.
If that is the problem, it might help to use {-# INLINE #-} pragmas on coordToCoord2D, fromCartesian2D and toCartesian2D. Can you try with realToFrac and the {-# INLINE #-} pragmas?
I tried inlining the functions you suggest with little effect. The realToFrac version (in this case I just set floatToFloat=realToFrac to save the search and replace effort) is just too heavily loaded to see any difference at all (98+% of CPU is spent in realToFrac). The same inlining using my definition of floatToFloat gave me a 10% improvement from 50% -> 46% of the CPU spent in floatToFloat and an inverse change in allocation to match.
Best I can tell, the inlining is being recognized, but just not changing much.
Looking at the Coord stuff more closely, you'd probably need much more inlining to get a good effect. And you probably need a bit more strictness too. ============================================================ --Coord2D is a typeclass I created to hold 2D data data Cartesian2D a = Cartesian2D a a deriving (Show, Eq, Read) -- Needs testing, but I suspect {- data Cartesian2D a = Cartesian2D !a !a deriving (...) or even data Cartesian2D a = Cartesian2D {-# UNPACK #-} !a {-# UNPACK #-} !a deriving (...) -} -- would have a beneficial effect. {- Pair instances -} instance (RealFloat a, RealFloat b) => Coord2D (a,b) where xComponent = realToFrac . fst yComponent = realToFrac . snd fromCartesian2D p = ((xComponent p),(yComponent p)) -- That might be too lazy, perhaps {- xComponent (x,_) = realToFrac x yComponent (_,y) = realToFrac y fromCartesian2D (Cartesian2D x y) = (x,y) -} -- will be better -- anyhow, maybe you need to inline all methods of Coord2D to get the rules to fire: class Coord2D a where {-# INLINE xComponent #-} xComponent :: (RealFloat b) => a -> b {-# INLINE yComponent #-} yComponent :: (RealFloat b) => a -> b {-# INLINE toCartesian2D #-} toCartesian2D :: (RealFloat b) => a -> Cartesian2D b toCartesian2D p = Cartesian2D (xComponent p) (yComponent p) {-# INLINE fromCartesian2D #-} fromCartesian2D :: (RealFloat b) => Cartesian2D b -> a -- I'm rather convinced inlining the component functions will be good, but -- there's a good chance that they're small enough to be inlined anyway. -- The inlining of the to/fromCratesian2D functions is doubtful, because --and this function allows conversion between coordinate representations coordToCoord2D :: (Coord2D a, Coord2D b) => a -> b coordToCoord2D = fromCartesian2D . toCartesian2D -- cries loudly for {-# RULES "toCart/fromCart" forall p. toCartesian2D (fromCartesian2D p) = p #-} -- whenever that's possible -- so, perhaps first try to rewrite, whenever that's possible, afterwards inline, hence -- {-# INLINE [2] toCartesian2D #-} -- {-# INLINE [2] fromCartesian2D #-} -- {-# RULES -- "toCart/fromCart" [~2] forall p. toCartesian (fromCartesian p) = p -- #-} -- dunno whether that works, but -ddump-simpl-stats should tell ============================================================ Finally, there's one other thing to try, with or without rules/inlining: coordToVertex2 :: Coord2D a => a -> (GL.Vertex2 GL.GLclampf) coordToVertex2 = coordToCoord2D GLclampf is a newtype wrapper around a newtype wrapper around Float. Coercing between newtype and original is supposed to be safe, so import Unsafe.Coerce floatToGLclampf :: Float -> GL.GLclampf floatToGLclampf = unsafeCoerce coordToVertex2 c = case coordToCoord2D c of (x,y) -> GL.Vertex2 (floatToGLclampf x) (floatToGLclampf y) That way, we circumvent a potentially expensive call to realToFrac :: a -> GLclampf for a = Double or a = Float and split it into a no-op (unsafeCoerce) and a hopefully cheap conversion to Float.
And still ran faster than floatToFloat. However there's no denying that floatToFloat runs *much* faster than realToFrac in the larger application. Profiling shows floatToFloat taking about 50% of my CPU
That's too much for my liking, a simple conversion from Double to Float shouldn't take long, even if the Float is wrapped in newtypes (after all, the newtypes don't exist at runtime).
Agreed. The rest of the application right now isn't doing a lot of work yet though-- I'm generating (pre-calculating, if Haskell is doing it's job) a list of 360*180 points on a sphere and dumping that to OpenGL which should be doing most of the dirty work in hardware. I'm not entirely sure why floatToFloat recalculates every iteration and isn't just cached,
Code? Maybe you have to give a name for it to be cached.
but I'm guessing it's because the floatToFloat is being done in an OpenGL callback within the IO monad. Eventually I'll be providing time-varying data anyway, so the conversions will have to be continuously recalculated in the end.
That comes out to 65000 conversions every 30ms, or about 2 million conversions a second. I'd probably just leave it at that except, as you've demonstrated, there is at least a factor of 3 or 4 to be gained somehow-- realToFrac can provide it under the right conditions.
{-# RULES "floatToFloat/id" floatToFloat=id "floatToFloat x2" floatToFloat . floatToFloat = floatToFloat #-}
I'm not sure how the rule-spotting works with compositions, whether it matches `foo . bar' with `foo (bar x)' [one in the code, the other in the rule], it might be necessary to give the rule in both forms.
Neither of which seems to fires in this application,
GHC reports fired rules with -ddump-simpl-stats. Getting rules to fire is a little brittle, GHC does not try too hard to match expressions with rules, and if several rules match, it chooses one arbitrarily, so your rules may have been missed because the actual code looked different (perhaps because other rewrite rules fired first).
Yeah, I've been looking at the -ddump-simp-stats output. If I'm reading the documentation right, rules are enabled simply by invoking ghc with -O or -O2, right?
Right, -O implies -fenable-rewrite-rules (and hence -O2 too). On the other hand, you can't have rewrite-rules without -O [that is, you can pass -fenable-rwerite-rules on the command line without -O, it will just have no effect]. Presumably the flag exists for its negation, so you can invoke GHC with -O -fno-enable-rewrite-rules to have the rules not firing.
I'm now not convinced any of my rewrite rules are firing-- or at least I can't seem to get them to again.
If they fire, -ddump-simpl-stats tells you, there's a piece like 9 RuleFired 1 ==#->case 1 ># 1 eftInt 1 fold/build 1 fromIntegral/Int->Double 1 int2Float# 1 realToFrac/Double->Float 1 unpack 1 unpack-list in the dump, if it contains the name of one of your rules, it fired n times, otherwise it didn't fire.
No, that doesn't do it. I tried a few variations on that and it always chokes on the => symbol or whatever other syntax I try to use. The Num constraint was added because it was needed on related functions (3 element vertices where the z was stuffed with 0, for example), so I got rid of those and the Num constraint. Doesn't matter, the rule still doesn't fire... =(
Might have been inlined before the rule got a chance to fire.
Cheers-- Greg