
This is a general problem when working with RULES-based optimisations. Here is an example of what happens: suppose we have
foo :: Vector Int -> Vector Int foo xs = map (+1) xs
Now, GHC will generate a nice tight loop for this but if in a different module, we have something like this:
bar xs = foo (foo xs)
then this won't fuse because (a) foo won't be inlined and (b) even if GHC did inline here, it would inline the nice tight loop which can't possibly fuse instead of the original map which can. By slapping an INLINE pragma on foo, you're telling GHC to (almost) always inline the function and to use the original definition for inlining, thus giving it a chance to fuse.
thanks for the insight, roman!
the downside after adding the INLINE pragmas is that now some of my modules take _really_ long to compile (up to a couple of minutes); any ideas where i can start looking to bring the compilation times down again?
Alas, stream fusion (and fusion in general, I guess) requires what I would call whole loop compilation - you need to inline everything into loops. That tends to be slow. I don't know what your code looks like but you could try to control inlining a bit more. For instance, if you have something like this:
foo ... = ... map f xs ... where f x = ...
you could tell GHC not to inline f until fairly late in the game by adding
{-# INLINE [0] f #-}
to the where clause. This helps sometimes.
thanks, i'll check it out.
I'm surprised -Odph doesn't produce faster code than -O2. In any case, you could try turning these flags on individually (esp. -fno-method-sharing and the spec-constr flags) to see how they affect performance and compilation times.
in the end it turned out that i had forgotten another INLINE pragma and in my crude benchmarks -O2 and -Odph give basically the same results, -O2 being a little faster. i hope i'll have time next week to do proper benchmarks, and i also want to try ghc HEAD with the llvm patches. conv_1 conv_2 conv_3 -Odph 1.004 2.715 1.096 -O2 1.000 2.710 1.097 i'm still curious, though, why my three versions of direct convolution perform so differently (see attached file). in particular, i somehow expected conv_3 to be the slowest and conv_2 to perform similar to conv_1. any ideas? i haven't had a look at the core yet, mainly because i'm lacking the expertise ... <sk>