
Hi, Johannes Waldmann wrote:
s2 :: Int -> Int s2 n = sum $ do x <- [ 0 .. n-1 ] y <- [ 0 .. n-1 ] return $ gcd x y
This code shows some interesting behaviour: its runtime depends heavily on the allocation area size. For comparison, with main = print $ s1 10000 I see the following GC statistics. # ./a.out +RTS -A512k -s 15,201,891,976 bytes allocated in the heap 4,272,144 bytes copied during GC Total time 9.97s ( 10.00s elapsed) %GC time 1.1% (1.1% elapsed) For s2, using main = print $ s2 10000, I get # ./s2 +RTS -A512k -s 20,801,251,976 bytes allocated in the heap 3,438,870,504 bytes copied during GC Total time 15.90s ( 15.95s elapsed) %GC time 34.3% (34.4% elapsed) # ./s2 +RTS -A1m -s 20,801,251,976 bytes allocated in the heap 2,724,903,032 bytes copied during GC Total time 14.74s ( 14.80s elapsed) %GC time 29.2% (29.3% elapsed) # ./s2 +RTS -A2m -s 20,801,251,976 bytes allocated in the heap 20,311,952 bytes copied during GC Total time 10.74s ( 10.78s elapsed) %GC time 1.2% (1.2% elapsed) # ./a.out +RTS -A2052k -s 20,801,251,976 bytes allocated in the heap 1,812,776 bytes copied during GC Total time 10.35s ( 10.38s elapsed) %GC time 0.8% (0.8% elapsed) Note that the number of bytes copied during GC drops by three orders of magnitude when we increase the allocation area size from 1 MB to 2 MB, and another order of magnitude when adding an additional 4kB to that. Why does this happen? The reason is a failure of generational GC heuristics. If we look at the core, we find that the inner loop (which generates a list that is consumed by sum) translates to something like nums = [0..9999] go xs0 = case xs0 of [] -> [] x : xs -> let go' ys0 = case ys0 of [] -> [] y : ys -> GHC.Real.gcdInt x y : go' ys in case go' nums of [] -> go xs (z : zs) -> z : zs ++ go xs At a first glance, this looks fine - there's one big chunk of fixed data (the nums list) that will end up in the old generation. The rest of the data is consumed as it is created. However, this is not quite true: The thunk for the second argument to (++) (representing 'go xs') is already allocated on the heap when the first element of the result of (++), i.e., the first element of zs, is consumed. While zs is processed, this thunk is never touched. If it survives a whole GC-mutator cycle, then the next garbage collection will consider the thunk mature and put it into the old generation. But when the code starts evaluating this 'go xs' call, it produces a long list, all of which is being kept alive by the (now updated) thunk in generation 1, and as a result will be copied during GC, until the next major GC. So the observed behaviour hinges on the question whether the go xs thunk can survive processing the zs list. The memory allocated during this time is constant - besides the list, some memory is allocated for thunks in go', and for intermediate Integers[*] in gcdInt. If the allocation area size exceeds this constant, then the program will run as expected. Note that every time a 'go xs' thunk survives, a lot of extra work is caused for the GC -- this explains the sharp increase in bytes copied observed above. Bertram [*] gcdInt really ought to only do a single allocation for the result, with an inlining opportunity to get rid of that as well. It is defined in GHC.Real as gcdInt :: Int -> Int -> Int gcdInt a b = fromIntegral (gcdInteger (fromIntegral a) (fromIntegral b)) which used to optimize to a single low-level GHC.Integer.Type.gcdInt call in ghc 7.2. With 7.4 and HEAD, integerToInt and smallInteger are no longer inlined, resulting in worse code. See also http://hackage.haskell.org/trac/ghc/ticket/7041