Call Arity, oneShot or both

Hi, some months ago I tried to make foldl a good consumer in the common case. The starting point is always to write foldl k a xs = foldr (\v f a -> f (v `k` a)) id xs a and then somehow make GHC produce good code with this. I came up with two solutions: A more sophisticated compiler analysis (Call Arity), or an explicit annotation in the form of foldlB k a xs = foldr (\v f -> oneShot (\a -> f (v `k` a))) id xs a where oneShot :: (a -> b) -> (a -> b) is a built-in function, semantically the identity, but telling the compiler that it can assume that the (\a -> ...) is called at most once. Back then, we decided to use Call Arity, on the grounds that it might improve other code as well, despite not having a lot of evindence that this may happen. Then recently David Feuer built on my work by making more functions fuse, including functions like scanl that are not built on foldl, but benefit from the same analysis. This supports the usefulness of Call Arity. But he also found cases where Call Arity is not powerful enough and GHC would produce bad code. So I wanted to properly compare Call Arity with the oneShot approach. Based on todays master (0855b249), I disabled Call Arity and changed the definitions of foldl, foldl', scanl and scanl' to use oneShot, and ran nofib. The result are mixed. With the current code, the oneShot machinery does not always work as expected: Program Size Allocs Runtime Elapsed TotalMem Min -0.1% -1.5% -2.8% -2.8% -5.8% Max +0.4% +4.7% +5.8% +5.6% +5.4% Geometric Mean -0.0% +0.1% +0.3% +0.3% +0.1% The biggest loser is calendar, which uses scanl. I am not fully sure what went wrong here: Either the one-shot annotation on the lambda’s variable got lost somewhere in the pipeline, or despite it being there, the normal arity analysis did not use it. But there is also a winner, fft2, with -1.5% allocations. Here Call Arity was not good enough, but oneShot did the jobs. There is also the option of combining both. Then we do not get the regression, but still the improvement for fft2: Min -0.1% -1.5% -3.9% -3.8% -5.8% Max +0.2% +0.1% +6.4% +6.3% +13.1% Geometric Mean -0.0% -0.0% +0.0% +0.0% +0.1% The oneShot code is on the branch wip/oneShot. The changes are clearly not ready to be merged. In particular, there is the question of how to best keep the oneShot annotation in the unfoldings: The isOneShotLambda flag is currently not stored in the interface. I work around this by making sure that the oneShot function is never inlined in unfoldings, but maybe it would be better to serialize the isOneShotLambda flag in interfaces, which might have other good effects? If we want as much performance as possible, we should simply include both approaches. But there might be other things to consider... so not sure what the best thing to do is. Greetings, Joachim -- Joachim “nomeata” Breitner mail@joachim-breitner.de • http://www.joachim-breitner.de/ Jabber: nomeata@joachim-breitner.de • GPG-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

Thanks for the interesting analysis Joachim. I wouldn't trust nofib's
runtime numbers (but the allocation ones should be good). The tests don't
run long enough and the way we don't handle typical noise issues (e.g.
clock resolution, benchmark length, etc) well enough. I'd love to see some
numbers using Criterion instead.
It might also be worth to check for which programs the Core changed.
On Sat, Oct 25, 2014 at 9:24 AM, Joachim Breitner
Hi,
some months ago I tried to make foldl a good consumer in the common case. The starting point is always to write
foldl k a xs = foldr (\v f a -> f (v `k` a)) id xs a
and then somehow make GHC produce good code with this. I came up with two solutions: A more sophisticated compiler analysis (Call Arity), or an explicit annotation in the form of
foldlB k a xs = foldr (\v f -> oneShot (\a -> f (v `k` a))) id xs a
where oneShot :: (a -> b) -> (a -> b) is a built-in function, semantically the identity, but telling the compiler that it can assume that the (\a -> ...) is called at most once.
Back then, we decided to use Call Arity, on the grounds that it might improve other code as well, despite not having a lot of evindence that this may happen.
Then recently David Feuer built on my work by making more functions fuse, including functions like scanl that are not built on foldl, but benefit from the same analysis. This supports the usefulness of Call Arity.
But he also found cases where Call Arity is not powerful enough and GHC would produce bad code. So I wanted to properly compare Call Arity with the oneShot approach.
Based on todays master (0855b249), I disabled Call Arity and changed the definitions of foldl, foldl', scanl and scanl' to use oneShot, and ran nofib.
The result are mixed. With the current code, the oneShot machinery does not always work as expected:
Program Size Allocs Runtime Elapsed TotalMem Min -0.1% -1.5% -2.8% -2.8% -5.8% Max +0.4% +4.7% +5.8% +5.6% +5.4% Geometric Mean -0.0% +0.1% +0.3% +0.3% +0.1%
The biggest loser is calendar, which uses scanl. I am not fully sure what went wrong here: Either the one-shot annotation on the lambda’s variable got lost somewhere in the pipeline, or despite it being there, the normal arity analysis did not use it.
But there is also a winner, fft2, with -1.5% allocations. Here Call Arity was not good enough, but oneShot did the jobs.
There is also the option of combining both. Then we do not get the regression, but still the improvement for fft2:
Min -0.1% -1.5% -3.9% -3.8% -5.8% Max +0.2% +0.1% +6.4% +6.3% +13.1% Geometric Mean -0.0% -0.0% +0.0% +0.0% +0.1%
The oneShot code is on the branch wip/oneShot. The changes are clearly not ready to be merged. In particular, there is the question of how to best keep the oneShot annotation in the unfoldings: The isOneShotLambda flag is currently not stored in the interface. I work around this by making sure that the oneShot function is never inlined in unfoldings, but maybe it would be better to serialize the isOneShotLambda flag in interfaces, which might have other good effects?
If we want as much performance as possible, we should simply include both approaches. But there might be other things to consider... so not sure what the best thing to do is.
Greetings, Joachim
-- Joachim “nomeata” Breitner mail@joachim-breitner.de • http://www.joachim-breitner.de/ Jabber: nomeata@joachim-breitner.de • GPG-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

| The biggest loser is calendar, which uses scanl. I am not fully sure | what went wrong here: Either the one-shot annotation on the lambda’s | variable got lost somewhere in the pipeline, or despite it being there, | the normal arity analysis did not use it. | | But there is also a winner, fft2, with -1.5% allocations. Here Call | Arity was not good enough, but oneShot did the jobs. It's always useful to discover what happens. Why didn't Call Arity do the job? How did the one-shot-ness get lost? I know it takes work to find out these things; but sometimes they show up something that is easy to fix, and which gives improvements across the board. | best keep the oneShot annotation in the unfoldings: The isOneShotLambda | flag is currently not stored in the interface. I work around this by | making sure that the oneShot function is never inlined in unfoldings, | but maybe it would be better to serialize the isOneShotLambda flag in | interfaces, which might have other good effects? Serialising the one-shot lambda info sounds like a good plan to me. | If we want as much performance as possible, we should simply include | both approaches. But there might be other things to consider... so not | sure what the best thing to do is. I'd be inclined to do both. Call-arity hits programs where the programmer had no idea that there was something to do; or where the lambda isn't *statically* one-shot... it only becomes so when inlined into a particular context. One-shot-lambdas may help in situations where the one-shot-ness is manifestly too hard to spot. Good work! Keep a wiki page to describe the choices, point to the tickets, etc Simon

Hi, Am Montag, den 27.10.2014, 22:36 +0000 schrieb Simon Peyton Jones:
| The biggest loser is calendar, which uses scanl. I am not fully sure | what went wrong here: Either the one-shot annotation on the lambda’s | variable got lost somewhere in the pipeline, or despite it being there, | the normal arity analysis did not use it. | | But there is also a winner, fft2, with -1.5% allocations. Here Call | Arity was not good enough, but oneShot did the jobs.
It's always useful to discover what happens. Why didn't Call Arity do the job?
I’ll see what I can do. But we already know that Call Arity is not complete, chance are high that it is among the known unsolvable cases (e.g. a recursive call in an argument position).
Good work! Keep a wiki page to describe the choices, point to the tickets, etc
I added a proposal page https://ghc.haskell.org/trac/ghc/wiki/OneShot here. I can do the implementation, including the interface changes, but I am not sure that I’ll be able to investigate each core2core transformation for where oneShot flags might be lost. But the good thing is that we can still deploy the whole thing without regressions and make then iteratively make the compiler better in preserving this bit.
| best keep the oneShot annotation in the unfoldings: The isOneShotLambda | flag is currently not stored in the interface. I work around this by | making sure that the oneShot function is never inlined in unfoldings, | but maybe it would be better to serialize the isOneShotLambda flag in | interfaces, which might have other good effects?
Serialising the one-shot lambda info sounds like a good plan to me.
Ok, thanks for guidance. Is https://ghc.haskell.org/trac/ghc/wiki/OneShot#PreservationofsetOneShotLambda... a sensible design? Greetings, Joachim -- Joachim “nomeata” Breitner mail@joachim-breitner.de • http://www.joachim-breitner.de/ Jabber: nomeata@joachim-breitner.de • GPG-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

| > Serialising the one-shot lambda info sounds like a good plan to me. | | Ok, thanks for guidance. Is | https://ghc.haskell.org/trac/ghc/wiki/OneShot#PreservationofsetOneShotLam | bdaacrossmoduleboundaries | a sensible design? Generally yes, but I'd define a new data IfaceLamBndr, rather like IfaceLetBndr, rather than clutter up IfaceLam itself. Simon

HI, Am Dienstag, den 28.10.2014, 14:42 +0000 schrieb Simon Peyton Jones:
| > Serialising the one-shot lambda info sounds like a good plan to me. | | Ok, thanks for guidance. Is | https://ghc.haskell.org/trac/ghc/wiki/OneShot#PreservationofsetOneShotLam | bdaacrossmoduleboundaries | a sensible design?
Generally yes, but I'd define a new data IfaceLamBndr, rather like IfaceLetBndr, rather than clutter up IfaceLam itself.
heh, that’s what I ended up doing :-) It seems to work quite well, I’m heating my room right now with a few nofib runs of various combinations. Also your suggestion to investigate fft2 was good: Turns out iterateFB would never inline, which defeats the purpose. With that fixed (just pushed to master) I expect Call Arity to handle the case in fft2 as well; well’ll see when the benchmarks finish. Greetings, Joachim -- Joachim “nomeata” Breitner mail@joachim-breitner.de • http://www.joachim-breitner.de/ Jabber: nomeata@joachim-breitner.de • GPG-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

Hi, Am Dienstag, den 28.10.2014, 15:45 +0100 schrieb Joachim Breitner:
Also your suggestion to investigate fft2 was good: Turns out iterateFB would never inline, which defeats the purpose. With that fixed (just pushed to master) I expect Call Arity to handle the case in fft2 as well; well’ll see when the benchmarks finish.
indeed. So nofib gives no hard evidence that oneShot does any good (nor does it do any harm).¹ But since it is plausible that there are cases out there where it might help, even if just a little, we could go forward –unless the implementation becomes ugly. It also seems that the OS=Once flag survives most transformations just fine; I had to add it to the list of IdInfo flags that make it through TidyCore, though. Serializing and reading it from the interface was also quite smooth. I think I can prepare a differential revision soon. After writing some Notes :-) Greetings, Joachim ¹ We need more benchmarks. -- Joachim “nomeata” Breitner mail@joachim-breitner.de • http://www.joachim-breitner.de/ Jabber: nomeata@joachim-breitner.de • GPG-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

| But since it is plausible that there are cases out there where it might | help, even if just a little, we could go forward –unless the | implementation becomes ugly. Yes, I'm content with that, provided it is well documented. In general, I think that when a programmer wants to be sure something is going to happen, it's better to allow them provide a pragma or annotation to say "this is what I want" rather than to rely on a complex and (who knows?) perhaps fragile analysis. So I'm all for keeping 'oneShot' in the definition of foldl or whatever, even if the analysis spots it. But with a Note please! Simon

Hi, Am Dienstag, den 28.10.2014, 22:02 +0000 schrieb Simon Peyton Jones:
So I'm all for keeping 'oneShot' in the definition of foldl or whatever, even if the analysis spots it. But with a Note please!
Three Notes actually! :-) I created three mouth-sized commits, each of which can be reviewed independently: https://phabricator.haskell.org/D391 https://phabricator.haskell.org/D392 https://phabricator.haskell.org/D393 You can also review the wip/oneShot branch if you prefer that or want to play around with it. The Wiki page will see some updates (e.g. moving „Challenges“ to „Implementation“) when I get to merge these. Greetings, Joachim -- Joachim “nomeata” Breitner mail@joachim-breitner.de • http://www.joachim-breitner.de/ Jabber: nomeata@joachim-breitner.de • GPG-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org
participants (3)
-
Joachim Breitner
-
Johan Tibell
-
Simon Peyton Jones