
Dear all, while doing some benchmarking (*) I noticed that function s1 is considerably faster than s2 (but I wanted s2 because it looks more natural) (for n = 10000, s1 takes 20 s, s2 takes 13 s; compiled by ghc-7.4.2 -O2) s1 :: Int -> Int s1 n = sum $ do x <- [ 0 .. n-1 ] return $ sum $ do y <- [ 0 .. n-1 ] return $ gcd x y s2 :: Int -> Int s2 n = sum $ do x <- [ 0 .. n-1 ] y <- [ 0 .. n-1 ] return $ gcd x y I was expecting that in both programs, all lists will be fused away (are they?) so the code generator essentially can produce straightforward assembly code (no allocations, no closures, etc.) For reference, I also wrote the equivalent imperative program (two nested loops, one accumulator for the sum) (with the straightforward recursive gcd) and runtimes are (for same input as above) C/gcc: 7.3 s , Java: 7.7 s, C#/Mono: 8.7 s So, they sort of agree with each other, but disagree with ghc. Where does the factor 2 come from? Lists? Laziness? Does ghc turn the tail recursion (in gcd) into a loop? (gcc does). (I am looking at -ddump-asm but can't quite see through it.) (*) benchmarking to show that today's compilers are clever enough such that the choice of paradigm/language does not really matter for this kind of low-level programming.

s1 ~ sum $ map (sum . flip map [0..n] . gcd) [0..n]
s2 ~ sum $ concatMap (flip map [0..n] . gcd) [0..n]
There are some posts from Joachim Breitner investigated fusion for
concatMap:
http://www.haskell.org/pipermail/haskell-cafe/2011-December/thread.html#9722...
2012/6/25 Johannes Waldmann
Dear all,
while doing some benchmarking (*) I noticed that function s1 is considerably faster than s2 (but I wanted s2 because it looks more natural) (for n = 10000, s1 takes 20 s, s2 takes 13 s; compiled by ghc-7.4.2 -O2)
s1 :: Int -> Int s1 n = sum $ do x <- [ 0 .. n-1 ] return $ sum $ do y <- [ 0 .. n-1 ] return $ gcd x y
s2 :: Int -> Int s2 n = sum $ do x <- [ 0 .. n-1 ] y <- [ 0 .. n-1 ] return $ gcd x y
I was expecting that in both programs, all lists will be fused away (are they?) so the code generator essentially can produce straightforward assembly code (no allocations, no closures, etc.)
For reference, I also wrote the equivalent imperative program (two nested loops, one accumulator for the sum) (with the straightforward recursive gcd) and runtimes are (for same input as above)
C/gcc: 7.3 s , Java: 7.7 s, C#/Mono: 8.7 s
So, they sort of agree with each other, but disagree with ghc. Where does the factor 2 come from? Lists? Laziness? Does ghc turn the tail recursion (in gcd) into a loop? (gcc does). (I am looking at -ddump-asm but can't quite see through it.)
(*) benchmarking to show that today's compilers are clever enough such that the choice of paradigm/language does not really matter for this kind of low-level programming.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

I wonder why this performs really badly, though (I would expect it to be the same as s2): s3 :: Int -> Int s3 n = sum [gcd x y | x <- [ 0 .. n-1 ], y <- [ 0 .. n-1 ]]
From the links posted by Dmitry, it might be that the code generated is made of 2 recursive calls: in fact, what I observe is a "stack space overflow" error on runtime...
L.
On Mon, Jun 25, 2012 at 10:09 AM, Dmitry Olshansky
s1 ~ sum $ map (sum . flip map [0..n] . gcd) [0..n] s2 ~ sum $ concatMap (flip map [0..n] . gcd) [0..n]
There are some posts from Joachim Breitner investigated fusion for concatMap:
http://www.haskell.org/pipermail/haskell-cafe/2011-December/thread.html#9722...
2012/6/25 Johannes Waldmann
Dear all,
while doing some benchmarking (*) I noticed that function s1 is considerably faster than s2 (but I wanted s2 because it looks more natural) (for n = 10000, s1 takes 20 s, s2 takes 13 s; compiled by ghc-7.4.2 -O2)
s1 :: Int -> Int s1 n = sum $ do x <- [ 0 .. n-1 ] return $ sum $ do y <- [ 0 .. n-1 ] return $ gcd x y
s2 :: Int -> Int s2 n = sum $ do x <- [ 0 .. n-1 ] y <- [ 0 .. n-1 ] return $ gcd x y
I was expecting that in both programs, all lists will be fused away (are they?) so the code generator essentially can produce straightforward assembly code (no allocations, no closures, etc.)
For reference, I also wrote the equivalent imperative program (two nested loops, one accumulator for the sum) (with the straightforward recursive gcd) and runtimes are (for same input as above)
C/gcc: 7.3 s , Java: 7.7 s, C#/Mono: 8.7 s
So, they sort of agree with each other, but disagree with ghc. Where does the factor 2 come from? Lists? Laziness? Does ghc turn the tail recursion (in gcd) into a loop? (gcc does). (I am looking at -ddump-asm but can't quite see through it.)
(*) benchmarking to show that today's compilers are clever enough such that the choice of paradigm/language does not really matter for this kind of low-level programming.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

In my test it works ~20% faster than s2 and ~20% slower than s1.
Did you use -O2 flag?
2012/6/25 Lorenzo Bolla
I wonder why this performs really badly, though (I would expect it to be the same as s2):
s3 :: Int -> Int s3 n = sum [gcd x y | x <- [ 0 .. n-1 ], y <- [ 0 .. n-1 ]]
From the links posted by Dmitry, it might be that the code generated is made of 2 recursive calls: in fact, what I observe is a "stack space overflow" error on runtime...
L.
On Mon, Jun 25, 2012 at 10:09 AM, Dmitry Olshansky
wrote: s1 ~ sum $ map (sum . flip map [0..n] . gcd) [0..n] s2 ~ sum $ concatMap (flip map [0..n] . gcd) [0..n]
There are some posts from Joachim Breitner investigated fusion for concatMap:
http://www.haskell.org/pipermail/haskell-cafe/2011-December/thread.html#9722...
2012/6/25 Johannes Waldmann
Dear all,
while doing some benchmarking (*) I noticed that function s1 is considerably faster than s2 (but I wanted s2 because it looks more natural) (for n = 10000, s1 takes 20 s, s2 takes 13 s; compiled by ghc-7.4.2 -O2)
s1 :: Int -> Int s1 n = sum $ do x <- [ 0 .. n-1 ] return $ sum $ do y <- [ 0 .. n-1 ] return $ gcd x y
s2 :: Int -> Int s2 n = sum $ do x <- [ 0 .. n-1 ] y <- [ 0 .. n-1 ] return $ gcd x y
I was expecting that in both programs, all lists will be fused away (are they?) so the code generator essentially can produce straightforward assembly code (no allocations, no closures, etc.)
For reference, I also wrote the equivalent imperative program (two nested loops, one accumulator for the sum) (with the straightforward recursive gcd) and runtimes are (for same input as above)
C/gcc: 7.3 s , Java: 7.7 s, C#/Mono: 8.7 s
So, they sort of agree with each other, but disagree with ghc. Where does the factor 2 come from? Lists? Laziness? Does ghc turn the tail recursion (in gcd) into a loop? (gcc does). (I am looking at -ddump-asm but can't quite see through it.)
(*) benchmarking to show that today's compilers are clever enough such that the choice of paradigm/language does not really matter for this kind of low-level programming.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

You are right, probably I didn't because I cannot reproduce it now.
Sorry for the noise.
(Anyway, I am still surprised that list-comprehension gives a different
result from do-notation in the list monad.)
L.
On Mon, Jun 25, 2012 at 11:55 AM, Dmitry Olshansky
In my test it works ~20% faster than s2 and ~20% slower than s1. Did you use -O2 flag?
2012/6/25 Lorenzo Bolla
I wonder why this performs really badly, though (I would expect it to be the same as s2):
s3 :: Int -> Int s3 n = sum [gcd x y | x <- [ 0 .. n-1 ], y <- [ 0 .. n-1 ]]
From the links posted by Dmitry, it might be that the code generated is made of 2 recursive calls: in fact, what I observe is a "stack space overflow" error on runtime...
L.
On Mon, Jun 25, 2012 at 10:09 AM, Dmitry Olshansky
wrote:
s1 ~ sum $ map (sum . flip map [0..n] . gcd) [0..n] s2 ~ sum $ concatMap (flip map [0..n] . gcd) [0..n]
There are some posts from Joachim Breitner investigated fusion for concatMap:
http://www.haskell.org/pipermail/haskell-cafe/2011-December/thread.html#9722...
2012/6/25 Johannes Waldmann
Dear all,
while doing some benchmarking (*) I noticed that function s1 is considerably faster than s2 (but I wanted s2 because it looks more natural) (for n = 10000, s1 takes 20 s, s2 takes 13 s; compiled by ghc-7.4.2 -O2)
s1 :: Int -> Int s1 n = sum $ do x <- [ 0 .. n-1 ] return $ sum $ do y <- [ 0 .. n-1 ] return $ gcd x y
s2 :: Int -> Int s2 n = sum $ do x <- [ 0 .. n-1 ] y <- [ 0 .. n-1 ] return $ gcd x y
I was expecting that in both programs, all lists will be fused away (are they?) so the code generator essentially can produce straightforward assembly code (no allocations, no closures, etc.)
For reference, I also wrote the equivalent imperative program (two nested loops, one accumulator for the sum) (with the straightforward recursive gcd) and runtimes are (for same input as above)
C/gcc: 7.3 s , Java: 7.7 s, C#/Mono: 8.7 s
So, they sort of agree with each other, but disagree with ghc. Where does the factor 2 come from? Lists? Laziness? Does ghc turn the tail recursion (in gcd) into a loop? (gcc does). (I am looking at -ddump-asm but can't quite see through it.)
(*) benchmarking to show that today's compilers are clever enough such that the choice of paradigm/language does not really matter for this kind of low-level programming.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 25 June 2012 02:04, Johannes Waldmann
Dear all,
while doing some benchmarking (*) I noticed that function s1 is considerably faster than s2 (but I wanted s2 because it looks more natural) (for n = 10000, s1 takes 20 s, s2 takes 13 s; compiled by ghc-7.4.2 -O2)
s1 :: Int -> Int s1 n = sum $ do x <- [ 0 .. n-1 ] return $ sum $ do y <- [ 0 .. n-1 ] return $ gcd x y
s2 :: Int -> Int s2 n = sum $ do x <- [ 0 .. n-1 ] y <- [ 0 .. n-1 ] return $ gcd x y
I was expecting that in both programs, all lists will be fused away (are they?)
Almost certainly not.
so the code generator essentially can produce straightforward assembly code (no allocations, no closures, etc.)
Unless it changed recently, sum does not fuse (as it is currently defined, using the current implementation of foldr/build fusion). Also, lists built using do notation do not (I think) translate into instances of foldr and build, only list comprehension syntax. On the first point: sum is a foldl and the current implementation of foldr/build fusion does not cope well with foldl. While foldl can be defined in terms of foldr the result is lots of closure allocations. This could in principle be fixed with an arity raising transformation, and GHC now does have a simple implementation of this transformation, so it may be possible to get sum as a foldl to fuse. I'm not sure that anyone has yet tried changing the sum implementation to try to get it to fuse. It would be an interesting experiment. On the second point: ghc has a special desugaring for list comprehensions (in -O mode) where it turns them into uses of foldr and build. On the other hand, do notation desugars into bind and return. I'm not sure how well the result fuses, it uses: foldr ((++) . k) []. You can find out, just look at the core. If all goes well then you should see a single list being built and then consumed by sum. Duncan

It's described in Andy Gill's PhD thesis (which describes the
foldr/build fusion).
http://ittc.ku.edu/~andygill/paper.php?label=GillPhD96 Section 4.4
describes the basic ideas. There aren't any further details, though.
Max's Strict Core paper also describes it a bit (Section 6):
http://www.cl.cam.ac.uk/~mb566/papers/tacc-hs09.pdf
On 27 June 2012 08:58, Dominic Steinitz
Duncan Coutts
writes: This could in principle be fixed with an arity raising transformation,
Do you have a reference to arity raising transformations?
Thanks, Dominic.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Push the envelope. Watch it bend.

Hi, Johannes Waldmann wrote:
s2 :: Int -> Int s2 n = sum $ do x <- [ 0 .. n-1 ] y <- [ 0 .. n-1 ] return $ gcd x y
This code shows some interesting behaviour: its runtime depends heavily on the allocation area size. For comparison, with main = print $ s1 10000 I see the following GC statistics. # ./a.out +RTS -A512k -s 15,201,891,976 bytes allocated in the heap 4,272,144 bytes copied during GC Total time 9.97s ( 10.00s elapsed) %GC time 1.1% (1.1% elapsed) For s2, using main = print $ s2 10000, I get # ./s2 +RTS -A512k -s 20,801,251,976 bytes allocated in the heap 3,438,870,504 bytes copied during GC Total time 15.90s ( 15.95s elapsed) %GC time 34.3% (34.4% elapsed) # ./s2 +RTS -A1m -s 20,801,251,976 bytes allocated in the heap 2,724,903,032 bytes copied during GC Total time 14.74s ( 14.80s elapsed) %GC time 29.2% (29.3% elapsed) # ./s2 +RTS -A2m -s 20,801,251,976 bytes allocated in the heap 20,311,952 bytes copied during GC Total time 10.74s ( 10.78s elapsed) %GC time 1.2% (1.2% elapsed) # ./a.out +RTS -A2052k -s 20,801,251,976 bytes allocated in the heap 1,812,776 bytes copied during GC Total time 10.35s ( 10.38s elapsed) %GC time 0.8% (0.8% elapsed) Note that the number of bytes copied during GC drops by three orders of magnitude when we increase the allocation area size from 1 MB to 2 MB, and another order of magnitude when adding an additional 4kB to that. Why does this happen? The reason is a failure of generational GC heuristics. If we look at the core, we find that the inner loop (which generates a list that is consumed by sum) translates to something like nums = [0..9999] go xs0 = case xs0 of [] -> [] x : xs -> let go' ys0 = case ys0 of [] -> [] y : ys -> GHC.Real.gcdInt x y : go' ys in case go' nums of [] -> go xs (z : zs) -> z : zs ++ go xs At a first glance, this looks fine - there's one big chunk of fixed data (the nums list) that will end up in the old generation. The rest of the data is consumed as it is created. However, this is not quite true: The thunk for the second argument to (++) (representing 'go xs') is already allocated on the heap when the first element of the result of (++), i.e., the first element of zs, is consumed. While zs is processed, this thunk is never touched. If it survives a whole GC-mutator cycle, then the next garbage collection will consider the thunk mature and put it into the old generation. But when the code starts evaluating this 'go xs' call, it produces a long list, all of which is being kept alive by the (now updated) thunk in generation 1, and as a result will be copied during GC, until the next major GC. So the observed behaviour hinges on the question whether the go xs thunk can survive processing the zs list. The memory allocated during this time is constant - besides the list, some memory is allocated for thunks in go', and for intermediate Integers[*] in gcdInt. If the allocation area size exceeds this constant, then the program will run as expected. Note that every time a 'go xs' thunk survives, a lot of extra work is caused for the GC -- this explains the sharp increase in bytes copied observed above. Bertram [*] gcdInt really ought to only do a single allocation for the result, with an inlining opportunity to get rid of that as well. It is defined in GHC.Real as gcdInt :: Int -> Int -> Int gcdInt a b = fromIntegral (gcdInteger (fromIntegral a) (fromIntegral b)) which used to optimize to a single low-level GHC.Integer.Type.gcdInt call in ghc 7.2. With 7.4 and HEAD, integerToInt and smallInteger are no longer inlined, resulting in worse code. See also http://hackage.haskell.org/trac/ghc/ticket/7041
participants (7)
-
Bertram Felgenhauer
-
Dmitry Olshansky
-
Dominic Steinitz
-
Duncan Coutts
-
Johannes Waldmann
-
Lorenzo Bolla
-
Thomas Schilling