Re: [Haskell-cafe] vector stream fusion, inlining and compilation time

5 Mar 2010

      ...
This is a general problem when working with RULES-based
optimisations. Here is an example of what happens: suppose we have
foo :: Vector Int -> Vector Int
foo xs = map (+1) xs
Now, GHC will generate a nice tight loop for this but if in a
different module, we have something like this:
bar xs = foo (foo xs)
then this won't fuse because (a) foo won't be inlined and (b) even if
GHC did inline here, it would inline the nice tight loop which can't
possibly fuse instead of the original map which can. By slapping an
INLINE pragma on foo, you're telling GHC to (almost) always inline the
function and to use the original definition for inlining, thus giving
it a chance to fuse.
thanks for the insight, roman!
...
...
the downside after adding the INLINE pragmas is that now some of my modules take
_really_ long to compile (up to a couple of minutes); any ideas where i can
start looking to bring the compilation times down again?
Alas, stream fusion (and fusion in general, I guess) requires what I
would call whole loop compilation - you need to inline everything into
loops. That tends to be slow. I don't know what your code looks like
but you could try to control inlining a bit more. For instance, if you
have something like this:
foo ... = ... map f xs ...
  where
    f x = ...
you could tell GHC not to inline f until fairly late in the game by adding
{-# INLINE [0] f #-}
to the where clause. This helps sometimes.
thanks, i'll check it out.
...
I'm surprised -Odph doesn't produce faster code than -O2. In any
case, you could try turning these flags on individually (esp.
-fno-method-sharing and the spec-constr flags) to see how they affect
performance and compilation times.
in the end it turned out that i had forgotten another INLINE pragma and in my
crude benchmarks -O2 and -Odph give basically the same results, -O2 being a
little faster. i hope i'll have time next week to do proper benchmarks, and i
also want to try ghc HEAD with the llvm patches.

        conv_1  conv_2  conv_3
-Odph   1.004   2.715   1.096
-O2     1.000   2.710   1.097

i'm still curious, though, why my three versions of direct convolution perform
so differently (see attached file). in particular, i somehow expected conv_3 to
be the slowest and conv_2 to perform similar to conv_1. any ideas? i haven't had
a look at the core yet, mainly because i'm lacking the expertise ...

<sk>

Re: [Haskell-cafe] vector stream fusion, inlining and compilation time

stefan kersten