Hello all,

First of all, I'm sorry that this email is so absurdly long. But it's not easy to explain the problem at hand, so I took a step-by-step approach. The executive summary is: GHC can do a great job with inlining, but often it doesn't, and I don't understand why. So I have some questions, which are highlighted in the text below. In general, any insights regarding inlining or improving the performance of generics are welcome. My final goal is to be able to state that generic functions (in particular using GHC.Generics) will have no runtime overhead whatsoever when compared to a handwritten type-specific version.


The setting

Generic programming is based on representing datatypes in a uniform way using a small set of representation types. Functions defined on those representation types can then be applied to all datatypes, because we can convert between datatypes and their representations.
However, generic functions tend to be slower than their specialised counterparts, because they have to deal with the conversions. But clever inlining (together with other compiler optimisations) can completely remove this overhead. The problem I'm tackling is how to tell GHC exactly what it should in the particular case of optimisation of generic code.


Simplified example

I'll focus on the problem of optimising a non-trivial function for generic enumeration of terms. My experience shows that GHC does quite good at optimising simple functions, especially consumers (like generic equality). But producers are trickier.

First, we'll need some auxiliary functions:

-- | Interleave elements from two lists. Similar to (++), but swap left and
-- right arguments on every recursive application.
--
-- From Mark Jones' talk at AFP2008
{-# NOINLINE (|||) #-}
(|||) :: [a] -> [a] -> [a]
[]     ||| ys = ys
(x:xs) ||| ys = x : ys ||| xs


-- | Diagonalization of nested lists. Ensure that some elements from every
-- sublist will be included. Handles infinite sublists.
--
-- >From Mark Jones' talk at AFP2008
{-# NOINLINE diag #-}
diag :: [[a]] -> [a]
diag = concat . foldr skew [] . map (map (\x -> [x]))

skew :: [[a]] -> [[a]] -> [[a]]
skew []     ys = ys
skew (x:xs) ys = x : combine (++) xs ys

combine :: (a -> a -> a) -> [a] -> [a] -> [a]
combine _ xs     []     = xs
combine _ []     ys     = ys
combine f (x:xs) (y:ys) = f x y : combine f xs ys

The particular implementation of these functions doesn't really matter. What's important is that we have a way to interleave lists (|||) and a way to diagonalise a matrix into a list (diag). We mark these functions as NOINLINE because inlining them will only make the core code more complicated (and may prevent rules from firing).

Suppose we have a type of Peano natural numbers:

data Nat = Ze | Su Nat deriving Eq

Implementing enumeration on this type is simple:

enumNat :: [Nat]
enumNat = [Ze] ||| map Su enumNat

Now, a generic representation of Nat in terms of sums and products could look something like this:

type RepNat = Either () Nat

That is, either a singleton (for the Ze case) or a Nat (for the Su case). Note that I am building a shallow representation, since at the leaves we have Nat, and not RepNat. This mimics the situation with current generic programming libraries (in particular GHC.Generics).

We'll need a way to convert between RepNat and Nat:

toNat :: RepNat -> Nat
toNat (Left ()) = Ze
toNat (Right n) = Su n

fromNat :: Nat -> RepNat
fromNat Ze = Left ()
fromNat (Su n) = Right n

(In fact, since we're only dealing with a generic producer we won't need the fromNat function.)

To get an enumeration for RepNat, we first need to know how to enumerate units and sums:

enumU :: [()]
enumU = [()]

enumEither :: [a] -> [b] -> [Either a b]
enumEither ea eb = map Left ea ||| map Right eb

Now we can define an enumeration for RepNat:

enumRepNat :: [RepNat]
enumRepNat = enumEither enumU enumNatFromRep

With the conversion function toNat, we can use enumRepNat to get an enumeration for Nat directly:

enumNatFromRep :: [Nat]
enumNatFromRep = map toNat enumRepNat

First, convince yourself that enumNatFromRep and enumNat are equivalent functions:

take 100 enumNat == take 100 enumNatFromRep

Now, what I want is that enumNatFromRep generates the same core code as enumNat. That should be possible; here are the necessary steps:

  map toNat enumRepNat

== { inline enumRepNat }

  map toNat (enumEither enumU enumNatFromRep)

== { inline enumEither }

  map toNat (map Left enumU ||| map Right enumNatFromRep)

== { inline enumU }

  map toNat (map Left [()] ||| map Right enumNatFromRep)

== { inline map }

  map toNat ([Left ()] ||| map Right enumNatFromRep)

== { free theorem (|||): forall f a b. map f (a ||| b) = map f a ||| map f b }

  map toNat [Left ()] ||| map toNat (map Right enumNatFromRep)

== { inline map }

  [toNat (Left ())] ||| map toNat (map Right enumNatFromRep)

== { definition of toNat (or inline toNat + case of constant) }

  [Ze] ||| map toNat (map Right enumNatFromRep)

== { functor composition law: forall f g l. map f (map g l) = map (f . g) l }

  [Ze] ||| map (toNat . Right) enumNatFromRep

== { definition of toNat (or inline toNat + case of constant) }

  [Ze] ||| map Su enumNatFromRep

Now let's see what the compiler generates. I'm using GHC-7.4.1. Let's compile with -O1 and use -ddump-simpl to see the final simplifier output (core code) for enumNatFromRep:

EnumAlone.enumNatFromRep :: [EnumAlone.Nat]
[GblId,
 Str=DmdType,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=0, Value=False,
         ConLike=False, Cheap=False, Expandable=False,
         Guidance=IF_ARGS [] 30 0}]
EnumAlone.enumNatFromRep =
  GHC.Base.map
    @ EnumAlone.RepNat
    @ EnumAlone.Nat
    EnumAlone.toNat
    EnumAlone.enumRepNat

EnumAlone.enumRepNat [Occ=LoopBreaker] :: [EnumAlone.RepNat]
[GblId, Str=DmdType]
EnumAlone.enumRepNat =
  EnumAlone.|||
    @ (Data.Either.Either () EnumAlone.Nat) lvl4_rvV lvl5_rvW

Ah, it didn't even inline enumRepNat because it made it a loop breaker. We certainly want to inline it, so let's add a pragma:

{-# INLINE enumRepNat #-}

Recompiling, we get:

EnumAlone.enumRepNat [InlPrag=INLINE (sat-args=0)]
  :: [EnumAlone.RepNat]
[GblId,
 Str=DmdType,
 Unf=Unf{Src=InlineStable, TopLvl=True, Arity=0, Value=False,
         ConLike=False, Cheap=False, Expandable=False,
         Guidance=ALWAYS_IF(unsat_ok=False,boring_ok=False)
         Tmpl= EnumAlone.enumEither
                 @ () @ EnumAlone.Nat EnumAlone.enumU EnumAlone.enumNatFromRep}]
EnumAlone.enumRepNat =
  EnumAlone.|||
    @ (Data.Either.Either () EnumAlone.Nat) lvl4_rvV lvl5_rvW

EnumAlone.enumNatFromRep [Occ=LoopBreaker] :: [EnumAlone.Nat]
[GblId, Str=DmdType]
EnumAlone.enumNatFromRep =
  GHC.Base.map
    @ EnumAlone.RepNat
    @ EnumAlone.Nat
    EnumAlone.toNat
    EnumAlone.enumRepNat

So no real difference in the generated code, other than the reassignment of loop breakers. For some reason enumRepNat still doesn't get inlined.

Question: why won't GHC inline enumRepNat, even when I tell it to do so with an INLINE pragma?

Well, let's inline it ourselves, then. We redefine enumNatFromRep to:

enumNatFromRep = map toNat (enumEither enumU enumNatFromRep)

This however doesn't help much. We get the following core:

lvl6_rw5 :: [Data.Either.Either () EnumAlone.Nat]
[GblId]
lvl6_rw5 =
  EnumAlone.|||
    @ (Data.Either.Either () EnumAlone.Nat) lvl4_rw3 lvl5_rw4

EnumAlone.enumNatFromRep [Occ=LoopBreaker] :: [EnumAlone.Nat]
[GblId, Str=DmdType]
EnumAlone.enumNatFromRep =
  GHC.Base.map
    @ EnumAlone.RepNat @ EnumAlone.Nat EnumAlone.toNat lvl6_rw5

GHC is really keen on floating that (|||) out. Let's be very explicit about inlining:

{-# INLINE toNat #-}
{-# INLINE enumU #-}
{-# INLINE enumEither #-}

Also, maybe it's floating it out because it doesn't have anything else to do to it. Let's add the free theorem of (|||) as a rule:

{-# RULES "ft |||" forall f a b. map f (a ||| b) = map f a ||| map f b #-}

We needed this in our manual derivation, so GHC should need it too. Recompiling, we see we've made some progress:

lvl5_ryv :: [Data.Either.Either () EnumAlone.Nat]
[GblId]
lvl5_ryv =
  GHC.Base.map
    @ EnumAlone.Nat
    @ (Data.Either.Either () EnumAlone.Nat)
    (Data.Either.Right @ () @ EnumAlone.Nat)
    EnumAlone.enumNatFromRep

lvl6_ryw :: [EnumAlone.Nat]
[GblId]
lvl6_ryw =
  GHC.Base.map
    @ (Data.Either.Either () EnumAlone.Nat)
    @ EnumAlone.Nat
    EnumAlone.toNat
    lvl5_ryv

EnumAlone.enumNatFromRep [Occ=LoopBreaker] :: [EnumAlone.Nat]
[GblId, Str=DmdType]
EnumAlone.enumNatFromRep =
  EnumAlone.||| @ EnumAlone.Nat lvl4_ryu lvl6_ryw

enumNatFromRep finally starts with (|||) directly. But its second argument, lvl6_ryw, is a map of lvl5_ryv, which is itself a map! At this stage I expected GHC to be aware of the fusion law for map, but it seems that it isn't.

Question: why is map fusion not happening automatically?

Let's add it as a rule:

{-# RULES "map/map1" forall f g l. map f (map g l) = map (f . g) l #-}

And now we're in a much better situation:

lvl3_ryD :: [Data.Either.Either () EnumAlone.Nat]
[GblId, Caf=NoCafRefs]
lvl3_ryD =
  GHC.Types.:
    @ (Data.Either.Either () EnumAlone.Nat)
    EnumAlone.fromNat1
    (GHC.Types.[] @ (Data.Either.Either () EnumAlone.Nat))

lvl4_ryE :: [EnumAlone.Nat]
[GblId]
lvl4_ryE =
  GHC.Base.map
    @ (Data.Either.Either () EnumAlone.Nat)
    @ EnumAlone.Nat
    EnumAlone.toNat
    lvl3_ryD

lvl5_ryF :: [EnumAlone.Nat]
[GblId]
lvl5_ryF =
  GHC.Base.map
    @ EnumAlone.Nat
    @ EnumAlone.Nat
    EnumAlone.Su
    EnumAlone.enumNatFromRep

EnumAlone.enumNatFromRep [Occ=LoopBreaker] :: [EnumAlone.Nat]
[GblId, Str=DmdType]
EnumAlone.enumNatFromRep =
  EnumAlone.||| @ EnumAlone.Nat lvl4_ryE lvl5_ryF

Note how toNat is entirely gone from the second part of the enumeration (lvl5_ryF). Strangely enough, the enumerator for Ze (lvl4_ryE) is still very complicated: map toNat ([Left ()]). Why doesn't GHC simplify this to just [Ze]? Apparently because GHC doesn't simplify map over a single element list.

Question: why doesn't GHC optimise map f [x] to [f x]?

Let's tell it to do so:

{-# RULES "map/map2" forall f x. map f (x:[]) = (f x):[] #-}

Now we're finally where we wanted:

lvl_ryA :: [EnumAlone.Nat]
[GblId, Caf=NoCafRefs]
lvl_ryA =
  GHC.Types.:
    @ EnumAlone.Nat EnumAlone.Ze (GHC.Types.[] @ EnumAlone.Nat)

lvl3_ryD :: [EnumAlone.Nat]
[GblId]
lvl3_ryD =
  GHC.Base.map
    @ EnumAlone.Nat
    @ EnumAlone.Nat
    EnumAlone.Su
    EnumAlone.enumNatFromRep

EnumAlone.enumNatFromRep [Occ=LoopBreaker] :: [EnumAlone.Nat]
[GblId, Str=DmdType]
EnumAlone.enumNatFromRep =
  EnumAlone.||| @ EnumAlone.Nat lvl_ryA lvl3_ryD

This is what I wanted: no more representation types (Either or ()), and the code looks exactly like what is generated for the handwritten enumNat.


More realistic generic programming

Now let's see if we can transport this to the setting of a generic programming library. I'll use a bare-bones version of GHC.Generics:

infixr 5 :+:
infixr 6 :*:

data U          = U              deriving (Show, Read)
data a :+: b    = L a | R b      deriving (Show, Read)
data a :*: b    = a :*: b        deriving (Show, Read)
newtype Var a   = Var a          deriving (Show, Read)
newtype Rec a   = Rec a          deriving (Show, Read)

class Representable a where
  type Rep a
  to   :: Rep a -> a
  from :: a -> Rep a

Let's represent Nat in this library:

instance Representable Nat where
  type Rep Nat = U :+: (Rec Nat)
  from Ze            = L U
  from (Su n)        = R (Rec n)
  to (L U)       = Ze
  to (R (Rec n)) = Su n

(Note, in particular, that we do not need INLINE pragmas on the from/to methods. This might just be because GHC thinks these are small and inlines them anyway, but in general we want to make sure they are inlined, so we typically use pragmas there.)

Now we need to implement enumeration generically. We do this by giving an instance for each representation type:

class GEnum' a where
  genum' :: [a]

instance GEnum' U where
  {-# INLINE genum' #-}
  genum' = [U]

instance (GEnum a) => GEnum' (Rec a) where
  {-# INLINE genum' #-}
  genum' = map Rec genum

instance (GEnum a) => GEnum' (Var a) where
  {-# INLINE genum' #-}
  genum' = map Var genum

instance (GEnum' f, GEnum' g) => GEnum' (f :+: g) where
  {-# INLINE genum' #-}
  genum' = map L genum' ||| map R genum'

instance (GEnum' f, GEnum' g) => GEnum' (f :*: g) where
  {-# INLINE genum' #-}
  --genum' = diag [ [ x :*: y | y <- genum' ] | x <- genum' ]
  genum' = diag (map (\x -> map (\y -> x :*: y) genum') genum')

We explicitly tell GHC to inline each case, as before. Note that for products I'm not using the more natural list comprehension syntax because I don't quite understand how that gets translated into core.

In the cases for Var and Rec we use genum from the GEnum class:

class GEnum a where
  genum :: [a]
  {-# INLINE genum #-}
  default genum :: (Representable a, GEnum' (Rep a)) => [a]
  genum = map to genum'

GEnum' is the class used for instantiating the generic representation types, and GEnum is used for user types. We use a default signature to provide a default method that can be used when we have a Representable instance for the type in question. This makes instantiating Nat very easy:

instance GEnum Nat

Unfortunately, the core code generated in this situation (with the same RULES as before) is not nice at all:

Main.$fGEnumNat_$cgenum [Occ=LoopBreaker] :: [Base.Nat]
[GblId, Str=DmdType]
Main.$fGEnumNat_$cgenum =
  GHC.Base.map
    @ (Base.Rep Base.Nat)
    @ Base.Nat
    Base.$fRepresentableNat_$cto
    (lvl37_r79y
     `cast` (Sym (GEnum.NTCo:GEnum') <Base.U Base.:+: (Base.Rec Base.Nat)> ;
                                       (GEnum.GEnum' (Sym (Base.TFCo:R:RepNat)) ;
                                         GEnum.NTCo:GEnum' <Base.Rep Base.Nat>)
             :: [Base.C Base.Nat_Ze_ Base.U
                 Base.:+: Base.C Base.Nat_Su_ (Base.Rec Base.Nat)]
                  ~#
                [Base.Rep Base.Nat]))

We see a map of the `to` function, which is definitely not what we want. Oddly enough, if we give an explicit definition of genum for Nat, with the inlined default...

instance GEnum Nat where genum = map to genum'

... then we get the optimised code we want:

lvl34_r79p :: [Base.Nat]
[GblId, Caf=NoCafRefs]
lvl34_r79p =
  GHC.Types.: @ Base.Nat Base.Ze (GHC.Types.[] @ Base.Nat)

lvl35_r79q :: [Base.Nat]
[GblId]
lvl35_r79q =
  GHC.Base.map @ Base.Nat @ Base.Nat Base.Su Main.$fGEnumNat_$cgenum

Main.$fGEnumNat_$cgenum [Occ=LoopBreaker] :: [Base.Nat]
[GblId, Str=DmdType]
Main.$fGEnumNat_$cgenum =
  GEnum.||| @ Base.Nat lvl34_r79p lvl35_r79q

Again, no representation types, no `to`, just the same code that is generated for enumNat. Perfect. But we had to avoid using the default definition, which is a pity.

Question: why won't GHC inline the default method of a class, even when I have a pragma telling it to do so?

Let's look at one more datatype, because Nat does not use products. So let's consider trees:

data Tree a = Leaf | Bin a (Tree a) (Tree a)

instance Representable (Tree a) where
  type Rep (Tree a) = U :+: (Var a :*: Rec (Tree a) :*: Rec (Tree a))
  from (Bin x l r) = R (Var x :*: Rec l :*: Rec r)
  from Leaf        = L U
  to (R (Var x :*: (Rec l) :*: (Rec r))) = Bin x l r
  to (L U)                               = Leaf

We give a GEnum instance using the same trick as before:

instance GEnum (Tree Int) where genum = map to genum'

(For simplicity only for trees of integers.) The generated code for trees is unfortunately not as nice:

a2_r79M
  :: [Base.Rec (Base.Tree GHC.Types.Int)
      Base.:*: Base.Rec (Base.Tree GHC.Types.Int)]
[GblId, Str=DmdType]
a2_r79M =
  GEnum.diag
    @ (Base.Rec (Base.Tree GHC.Types.Int)
       Base.:*: Base.Rec (Base.Tree GHC.Types.Int))
    lvl8_r79L

lvl9_r79N :: [Base.Tree GHC.Types.Int]
[GblId]
lvl9_r79N =
  GHC.Base.map
    @ (Base.Rec (Base.Tree GHC.Types.Int)
       Base.:*: Base.Rec (Base.Tree GHC.Types.Int))
    @ (Base.Tree GHC.Types.Int)
    lvl5_r79H
    a2_r79M

Main.$fGEnumTree_$cgenum [Occ=LoopBreaker]
  :: [Base.Tree GHC.Types.Int]
[GblId, Str=DmdType]
Main.$fGEnumTree_$cgenum =
  GEnum.||| @ (Base.Tree GHC.Types.Int) lvl4_r79G lvl9_r79N

Note how lvl9_r79N is a map over a2_r79M, and a2_r79M is a `diag`. Ok, we need the free theorem of `diag` to tell GHC how to commute the `diag` with map:

{-# RULES "ft/diag" forall f l. map f (diag l) = diag (map (map f) l) #-}

Unfortunately this doesn't change the generated core code. With some more debugging looking at the generated code at each simplifier iteration, I believe that this is because a2_r79M got lifted out too soon, prevent the rule from applying. With some imagination I decided to try the -fno-full-laziness flag to prevent let-floating. I'm not sure this is a good idea in general, but in this particular case it gives much better results:

a1_r72i :: [Base.Rec (Base.Tree GHC.Types.Int)]
[GblId, Str=DmdType]
a1_r72i =
  GHC.Base.map
    @ (Base.Tree GHC.Types.Int)
    @ (Base.Rec (Base.Tree GHC.Types.Int))
    ((\ (tpl_B1 :: Base.Tree GHC.Types.Int) -> tpl_B1)
     `cast` (<Base.Tree GHC.Types.Int>
             -> Sym (Base.NTCo:Rec <Base.Tree GHC.Types.Int>)
             :: (Base.Tree GHC.Types.Int -> Base.Tree GHC.Types.Int)
                  ~#
                (Base.Tree GHC.Types.Int -> Base.Rec (Base.Tree GHC.Types.Int))))
    Main.$fGEnumTree_$cgenum

Main.$fGEnumTree_$cgenum [Occ=LoopBreaker]
  :: [Base.Tree GHC.Types.Int]
[GblId, Str=DmdType]
Main.$fGEnumTree_$cgenum =
  GEnum.|||
    @ (Base.Tree GHC.Types.Int)
    (GHC.Types.:
       @ (Base.Tree GHC.Types.Int)
       (Base.Leaf @ GHC.Types.Int)
       (GHC.Types.[] @ (Base.Tree GHC.Types.Int)))
    (GEnum.diag
       @ (Base.Tree GHC.Types.Int)
       (GHC.Base.map
          @ (Base.Rec (Base.Tree GHC.Types.Int))
          @ [Base.Tree GHC.Types.Int]
          (\ (x_a1yQ :: Base.Rec (Base.Tree GHC.Types.Int)) ->
             GHC.Base.map
               @ (Base.Rec (Base.Tree GHC.Types.Int))
               @ (Base.Tree GHC.Types.Int)
               (\ (x1_X1zB :: Base.Rec (Base.Tree GHC.Types.Int)) ->
                  Base.Bin
                    @ GHC.Types.Int
                    a_r72h
                    (x_a1yQ
                     `cast` (Base.NTCo:Rec <Base.Tree GHC.Types.Int>
                             :: Base.Rec (Base.Tree GHC.Types.Int) ~# Base.Tree GHC.Types.Int))
                    (x1_X1zB
                     `cast` (Base.NTCo:Rec <Base.Tree GHC.Types.Int>
                             :: Base.Rec (Base.Tree GHC.Types.Int) ~# Base.Tree GHC.Types.Int)))
               a1_r72i)
          a1_r72i))

Note how our enum is now of the shape `[Leaf] ||| diag y`, which is good. The only catch is that there are some `Rec`s still laying around, with their associated newtype coercions, and a function a1_r72i that basically wraps the recursive enumeration in a Rec, only to be unwrapped in the body of `$fGEnumTree_$cgenum`. I don't know how to get GHC to simplify this code any further.

Question: why do I need -fno-full-laziness for the ft/diag rule to apply?

Question: why is GHC not getting rid of the Rec newtype in this case?

I have also played with -O2, in particular because of the SpecConstr optimisation, but found that it does not affect these particular examples (perhaps it only becomes important with larger datatypes). I have also experimented with phase control in the rewrite rules and the inline pragmas, but didn't find it necessary for this example. In general, anyway, my experience with the inliner is that it is extremely fragile, especially across different GHC versions, and it's hard to get any guarantees of optimisation. I have also played with the -funfolding-* options before, with mixed results. [1] It's also a pity that certain flags are not explained in detail in the user's manual [2,3], like -fliberate-case, and -fspec-constr-count and threshold, for instance.

Thank you for reading this. Any insights are welcome. In particular, I'm wondering if I might be missing some details regarding strictness.


Cheers,
Pedro

[1] http://dreixel.net/research/pdf/ogie.pdf
[2] http://www.haskell.org/ghc/docs/latest/html/users_guide/flag-reference.html
[3] http://www.haskell.org/ghc/docs/latest/html/users_guide/options-optimise.html#options-f