Re: [Haskell-cafe] not enough fusion?

2 Jul 2012

      Hi,

Johannes Waldmann wrote:
...
s2 :: Int -> Int       
s2 n = sum $ do 
      x <- [ 0 .. n-1 ]
      y <- [ 0 .. n-1 ]
      return $ gcd x y
This code shows some interesting behaviour: its runtime depends heavily
on the allocation area size.

For comparison, with  main = print $ s1 10000  I see the following GC
statistics.

# ./a.out +RTS -A512k -s
  15,201,891,976 bytes allocated in the heap
       4,272,144 bytes copied during GC
  Total   time    9.97s  ( 10.00s elapsed)
  %GC     time       1.1%  (1.1% elapsed)

For s2, using  main = print $ s2 10000,  I get

# ./s2 +RTS -A512k -s
  20,801,251,976 bytes allocated in the heap
   3,438,870,504 bytes copied during GC
  Total   time   15.90s  ( 15.95s elapsed)
  %GC     time      34.3%  (34.4% elapsed)
# ./s2 +RTS -A1m -s
  20,801,251,976 bytes allocated in the heap
   2,724,903,032 bytes copied during GC
  Total   time   14.74s  ( 14.80s elapsed)
  %GC     time      29.2%  (29.3% elapsed)
# ./s2 +RTS -A2m -s
  20,801,251,976 bytes allocated in the heap
      20,311,952 bytes copied during GC
  Total   time   10.74s  ( 10.78s elapsed)
  %GC     time       1.2%  (1.2% elapsed)
# ./a.out +RTS -A2052k -s
  20,801,251,976 bytes allocated in the heap
       1,812,776 bytes copied during GC
  Total   time   10.35s  ( 10.38s elapsed)
  %GC     time       0.8%  (0.8% elapsed)

Note that the number of bytes copied during GC drops by three orders of
magnitude when we increase the allocation area size from 1 MB to 2 MB,
and another order of magnitude when adding an additional 4kB to that.

Why does this happen? The reason is a failure of generational GC
heuristics. If we look at the core, we find that the inner loop (which
generates a list that is consumed by sum) translates to something like

    nums = [0..9999]

    go xs0 = case xs0 of
        [] -> []
        x : xs -> let
            go' ys0 = case ys0 of
                [] -> []
                y : ys -> GHC.Real.gcdInt x y : go' ys
          in
            case go' nums of
                [] -> go xs
                (z : zs) -> z : zs ++ go xs

At a first glance, this looks fine - there's one big chunk of fixed data
(the nums list) that will end up in the old generation. The rest of the
data is consumed as it is created. However, this is not quite true: The
thunk for the second argument to (++) (representing 'go xs') is already
allocated on the heap when the first element of the result of (++), i.e.,
the first element of zs, is consumed. While  zs  is processed, this thunk
is never touched. If it survives a whole GC-mutator cycle, then the next
garbage collection will consider the thunk mature and put it into the
old generation. But when the code starts evaluating this 'go xs' call,
it produces a long list, all of which is being kept alive by the (now
updated) thunk in generation 1, and as a result will be copied during
GC, until the next major GC.

So the observed behaviour hinges on the question whether the  go xs
thunk can survive processing the zs list. The memory allocated during
this time is constant - besides the list, some memory is allocated
for thunks in go', and for intermediate Integers[*] in gcdInt. If the
allocation area size exceeds this constant, then the program will
run as expected. Note that every time a 'go xs' thunk survives, a lot
of extra work is caused for the GC -- this explains the sharp increase
in bytes copied observed above.

Bertram

[*] gcdInt really ought to only do a single allocation for the result,
with an inlining opportunity to get rid of that as well. It is defined
in GHC.Real as

gcdInt :: Int -> Int -> Int
gcdInt a b = fromIntegral (gcdInteger (fromIntegral a) (fromIntegral b))

which used to optimize to a single low-level GHC.Integer.Type.gcdInt call
in ghc 7.2. With 7.4 and HEAD, integerToInt and smallInteger are no longer
inlined, resulting in worse code. See also

  http://hackage.haskell.org/trac/ghc/ticket/7041

Re: [Haskell-cafe] not enough fusion?

Bertram Felgenhauer