Re: performance regressions

Joachim Breitner
Hi,
Am Montag, den 15.12.2014, 10:58 -0500 schrieb Ben Gamari:
- Travis has not picked up on these errors.
unfortunately, travis is slighly less useful since a few weeks due to T5681 failing (possibly due to the use of LLVM-3.4), but I’m still waiting for an reply on that issue.
You aren't looking for a response from me on this, are you? I just checked and I don't seem to have any outstanding messages from you but it's entirely possible I overlooked something.
this is independent of our arm issues, and I think a tad older; I did not direct it to anyone specific.
But I guess you are likely a person that can tell what’s wrong here:
Am Sonntag, den 30.11.2014, 20:01 +0100 schrieb Joachim Breitner:
Compile failed (status 256) errors were: /tmp/ghc16123_0/ghc16123_5.s: Assembler messages:
/tmp/ghc16123_0/ghc16123_5.s:26:0: Error: can't resolve `.rodata' {.rodata section} - `Main_zdwwork_info$def' {.text section}
/tmp/ghc16123_0/ghc16123_5.s:46:0: Error: can't resolve `.rodata' {.rodata section} - `Main_work_info$def' {.text section}
/tmp/ghc16123_0/ghc16123_5.s:66:0: Error: can't resolve `.rodata' {.rodata section} - `Main_main1_info$def' {.text section}
/tmp/ghc16123_0/ghc16123_5.s:86:0: Error: can't resolve `.rodata' {.rodata section} - `Main_main_info$def' {.text section}
/tmp/ghc16123_0/ghc16123_5.s:106:0: Error: can't resolve `.rodata' {.rodata section} - `Main_main2_info$def' {.text section}
/tmp/ghc16123_0/ghc16123_5.s:126:0: Error: can't resolve `.rodata' {.rodata section} - `ZCMain_main_info$def' {.text section}
*** unexpected failure for T5681(optllvm)
https://s3.amazonaws.com/archive.travis-ci.org/jobs/42557559/log.txt
Any ideas?
Is it possible that this is due the llvm version used? Do we support 3.4 in GHC HEAD?
Using LLVM tools llc : /usr/local/clang-3.4/bin/llc opt : /usr/local/clang-3.4/bin/opt
(http://smart-cactus.org/~ben/posts/2014-11-28-state-of-llvm-backend.html does not talk about GHC HEAD explicitly. Should I look at the 7.10 row? Does that mean that 3.4 is not supported? Shouldn’t the build system, or at least the compiler, fail harder and more helpfully in this case?)
LLVM 3.4 appears to have an unfortunate behavior whereby it will lose track of which section symbols with Internal linkage belong. I haven't had a chance to delve into this too deeply, however given that both 3.3 and 3.5 behave as expected I'm pretty sure this a bug. There are a few options here, a. Mark the `$def` symbols as ExternallyVisible, working around the issue at the expense of exposing internal symbols to the outside world. b. Mark LLVM 3.4 as unsupported At the moment I'm leaning towards (b) since I haven't had a chance to think through the implications of (a); if nothing else I suspect this wouldn't help the DLL symbol table size issues on Windows. Giving up on LLVM 3.4 might be unfortunate for a good number of users, however. Ultimately this underlines the need to package LLVM with GHC. Cheers, - Ben

I've made progress, but still need some help. It turns out that a monadic combinator (that I wrote) is mostly responsible:
zipWithAndUnzipM :: Monad m => (a -> b -> m (c, d)) -> [a] -> [b] -> m ([c], [d]) zipWithAndUnzipM f (x:xs) (y:ys) = do { (c, d) <- f x y ; (cs, ds) <- zipWithAndUnzipM f xs ys ; return (c:cs, d:ds) } zipWithAndUnzipM _ _ _ = return ([], [])
Using this combinator instead of writing the algorithm directly cost me 30% allocation overhead!
Can anyone tell me: why? Have I made some horrible mistake in the implementation?
And, relatedly: how can I fix this? I want to learn from this experience how to avoid this problem next time...
Unfortunately, my commit causes 50% overhead, not 30%, so I'm not out of the woods yet. Hopefully, another 20% of good news tomorrow.
Thanks!
Richard
On Dec 15, 2014, at 11:33 AM, Ben Gamari
Joachim Breitner
writes: Hi,
Am Montag, den 15.12.2014, 10:58 -0500 schrieb Ben Gamari:
- Travis has not picked up on these errors.
unfortunately, travis is slighly less useful since a few weeks due to T5681 failing (possibly due to the use of LLVM-3.4), but I’m still waiting for an reply on that issue.
You aren't looking for a response from me on this, are you? I just checked and I don't seem to have any outstanding messages from you but it's entirely possible I overlooked something.
this is independent of our arm issues, and I think a tad older; I did not direct it to anyone specific.
But I guess you are likely a person that can tell what’s wrong here:
Am Sonntag, den 30.11.2014, 20:01 +0100 schrieb Joachim Breitner:
Compile failed (status 256) errors were: /tmp/ghc16123_0/ghc16123_5.s: Assembler messages:
/tmp/ghc16123_0/ghc16123_5.s:26:0: Error: can't resolve `.rodata' {.rodata section} - `Main_zdwwork_info$def' {.text section}
/tmp/ghc16123_0/ghc16123_5.s:46:0: Error: can't resolve `.rodata' {.rodata section} - `Main_work_info$def' {.text section}
/tmp/ghc16123_0/ghc16123_5.s:66:0: Error: can't resolve `.rodata' {.rodata section} - `Main_main1_info$def' {.text section}
/tmp/ghc16123_0/ghc16123_5.s:86:0: Error: can't resolve `.rodata' {.rodata section} - `Main_main_info$def' {.text section}
/tmp/ghc16123_0/ghc16123_5.s:106:0: Error: can't resolve `.rodata' {.rodata section} - `Main_main2_info$def' {.text section}
/tmp/ghc16123_0/ghc16123_5.s:126:0: Error: can't resolve `.rodata' {.rodata section} - `ZCMain_main_info$def' {.text section}
*** unexpected failure for T5681(optllvm)
https://s3.amazonaws.com/archive.travis-ci.org/jobs/42557559/log.txt
Any ideas?
Is it possible that this is due the llvm version used? Do we support 3.4 in GHC HEAD?
Using LLVM tools llc : /usr/local/clang-3.4/bin/llc opt : /usr/local/clang-3.4/bin/opt
(http://smart-cactus.org/~ben/posts/2014-11-28-state-of-llvm-backend.html does not talk about GHC HEAD explicitly. Should I look at the 7.10 row? Does that mean that 3.4 is not supported? Shouldn’t the build system, or at least the compiler, fail harder and more helpfully in this case?)
LLVM 3.4 appears to have an unfortunate behavior whereby it will lose track of which section symbols with Internal linkage belong. I haven't had a chance to delve into this too deeply, however given that both 3.3 and 3.5 behave as expected I'm pretty sure this a bug. There are a few options here,
a. Mark the `$def` symbols as ExternallyVisible, working around the issue at the expense of exposing internal symbols to the outside world.
b. Mark LLVM 3.4 as unsupported
At the moment I'm leaning towards (b) since I haven't had a chance to think through the implications of (a); if nothing else I suspect this wouldn't help the DLL symbol table size issues on Windows. Giving up on LLVM 3.4 might be unfortunate for a good number of users, however.
Ultimately this underlines the need to package LLVM with GHC.
Cheers,
- Ben
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

Using this combinator instead of writing the algorithm directly cost me 30% allocation overhead! What does your algorithm look like when you write it directly? Something like this:
flatten_many fmode roles tys = unzip `liftM` mapM go (zip roles tys) where go (Nominal,ty) = flatten_one (fmode { fe_eq_rel = NomEq }) ty go (Representational,ty) = flatten_one (fmode { fe_eq_rel = ReprEq }) ty go (Phantom, ty) = -- See Note [Phantoms in the flattener] return (ty, mkTcPhantomCo ty ty) ? Maybe this has something to do with `zipWithAndUnzipM` not being tail-recursive vs. direct version being able to fuse intermediate lists? Janek

On Dec 16, 2014, at 2:59 AM, Jan Stolarek
What does your algorithm look like when you write it directly? Something like this:
flatten_many fmode roles tys = unzip `liftM` mapM go (zip roles tys) where go (Nominal,ty) = flatten_one (fmode { fe_eq_rel = NomEq }) ty go (Representational,ty) = flatten_one (fmode { fe_eq_rel = ReprEq }) ty go (Phantom, ty) = -- See Note [Phantoms in the flattener] return (ty, mkTcPhantomCo ty ty)
?
Maybe this has something to do with `zipWithAndUnzipM` not being tail-recursive vs. direct version being able to fuse intermediate lists?
My direct version is even uglier:
flatten_many fmode roles tys = go roles tys where go (Nominal:rs) (ty:tys) = do { (xi, co) <- flatten_one (setFEEqRel fmode NomEq) ty ; (xis, cos) <- go rs tys ; return (xi:xis, co:cos) } go (Representational:rs) (ty:tys) = do { (xi, co) <- flatten_one (setFEEqRel fmode ReprEq) ty ; (xis, cos) <- go rs tys ; return (xi:xis, co:cos) } go (Phantom:rs) (ty:tys) = do { (xis, cos) <- go rs tys ; -- See Note [Phantoms in the flattener] return (ty:xis, mkTcPhantomCo ty ty:cos) } go _ _ = return ([], [])
I could refactor to make it better, but I would be worried that the version you wrote would suffer from the same problems as zipWithAndUnzipM. Will check to see, though.
On Dec 16, 2014, at 4:01 AM, Joachim Breitner
another guess (without looking at the code, sorry): Are they in the same module? I.e., can GHC specialize the code to your particular Monad?
No, they're not in the same module. I could also try moving the zipWithAndUnzipM function to the same module, and even specializing it by hand to the right monad. Could that be preventing the fusing?
On Dec 16, 2014, at 8:49 AM, Simon Peyton Jones
That seems surprising. I'd build a profiled compiler (GhcProfiled=YES) and see what it says.
If it increases allocation by 30% overall, there must be a LOT of calls to this function. Should there be so many?
I've been working from a profiled compiler. That's how I found that this function was the culprit -- it certainly wasn't my first guess! Yes, there are A LOT of calls: 7,106,808 to be exact. (The test case is perf/compiler/T9872a; the function is flatten_many). The number of calls doesn't vary from before my commit, though, so the raw number isn't the problem -- it's the allocation. I'll try turning some of these knobs to see where the difference is. Thanks, Richard

Hi, Am Dienstag, den 16.12.2014, 09:59 -0500 schrieb Richard Eisenberg:
On Dec 16, 2014, at 4:01 AM, Joachim Breitner
wrote: another guess (without looking at the code, sorry): Are they in the same module? I.e., can GHC specialize the code to your particular Monad?
No, they're not in the same module. I could also try moving the zipWithAndUnzipM function to the same module, and even specializing it by hand to the right monad.
I did mean zipWithAndUnzipM, so maybe yes: Try that. (I find it hard to believe that any polymorphic monadic code should perform well, with those many calls to an unknown (>>=) with a function parameter, but maybe I’m too pessimistic here.)
Could that be preventing the fusing?
There is not going to be any fusing here, at least not list fusion; that would require your code to be written in terms of functions with fusion rules. Greetings, Joachim -- Joachim “nomeata” Breitner mail@joachim-breitner.de • http://www.joachim-breitner.de/ Jabber: nomeata@joachim-breitner.de • GPG-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

I've learned several very interesting things in this analysis.
- Inlining polymorphic methods is very important. Here are some data points to back up that claim:
* Original implementation using zipWithAndUnzipM: 8,472,613,440 bytes allocated in the heap
* Adding {-# INLINE #-} to the definition thereof: 6,639,253,488 bytes allocated in the heap
* Using `inline` at call site to force inlining: 6,281,539,792 bytes allocated in the heap
The middle step above allowed GHC to specialize zipWithAndUnzipM to my particular monad, but GHC didn't see fit to actually inline the function. Using `inline` forced it, to good effect. (I did not collect data on code sizes, but it wouldn't be hard to.)
By comparison:
* Hand-written recursion: 6,587,809,112 bytes allocated in the heap
Interestingly, this is *not* the best result!
Conclusion: We should probably add INLINE pragmas to Util and MonadUtils.
- I then looked at rejiggering the algorithm to keep the common case fast. This had a side effect of changing the zipWithAndUnzipM to mapAndUnzipM, from Control.Monad. To my surprise, this brought disaster!
* Using `inline` and mapAndUnzipM: 7,463,047,432 bytes allocated in the heap
* Hand-written recursion: 5,848,602,848 bytes allocated in the heap
That last number is better than the numbers above because of the algorithm streamlining. But, the inadequacy of mapAndUnzipM surprised me -- it already has an INLINE pragma in Control.Monad of course. Looking at -ddump-simpl, it seems that mapAndUnzipM was indeed getting inlined, but a call to `map` remained, perhaps causing extra allocation.
Conclusion: We should examine the implementation of mapAndUnzipM (and similar functions) in Control.Monad. Is it as fast as possible?
In the end, I was unable to bring the allocation numbers down to where they were before my work. This is because the flattener now deals in roles. Most of its behavior is the same between nominal and representational roles, so it seems silly (though very possible) to specialize the code to nominal to keep that path fast. Instead, I identified one key spot and made that go fast.
Thus, there is a 7% bump to memory usage on very-type-family-heavy code, compared to before my commit on Friday. (On more ordinary code, there is no noticeable change.)
Validating my patch locally now; will push when that's done.
Thanks,
Richard
On Dec 16, 2014, at 10:41 AM, Joachim Breitner
Hi,
Am Dienstag, den 16.12.2014, 09:59 -0500 schrieb Richard Eisenberg:
On Dec 16, 2014, at 4:01 AM, Joachim Breitner
wrote: another guess (without looking at the code, sorry): Are they in the same module? I.e., can GHC specialize the code to your particular Monad?
No, they're not in the same module. I could also try moving the zipWithAndUnzipM function to the same module, and even specializing it by hand to the right monad.
I did mean zipWithAndUnzipM, so maybe yes: Try that.
(I find it hard to believe that any polymorphic monadic code should perform well, with those many calls to an unknown (>>=) with a function parameter, but maybe I’m too pessimistic here.)
Could that be preventing the fusing?
There is not going to be any fusing here, at least not list fusion; that would require your code to be written in terms of functions with fusion rules.
Greetings, Joachim
-- Joachim “nomeata” Breitner mail@joachim-breitner.de • http://www.joachim-breitner.de/ Jabber: nomeata@joachim-breitner.de • GPG-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

Would specialize pragmas also have these performance improvements, or is
inline needed?
On Tue, Dec 16, 2014 at 4:45 PM, Richard Eisenberg
I've learned several very interesting things in this analysis.
- Inlining polymorphic methods is very important. Here are some data points to back up that claim: * Original implementation using zipWithAndUnzipM: 8,472,613,440 bytes allocated in the heap * Adding {-# INLINE #-} to the definition thereof: 6,639,253,488 bytes allocated in the heap * Using `inline` at call site to force inlining: 6,281,539,792 bytes allocated in the heap
The middle step above allowed GHC to specialize zipWithAndUnzipM to my particular monad, but GHC didn't see fit to actually inline the function. Using `inline` forced it, to good effect. (I did not collect data on code sizes, but it wouldn't be hard to.)
By comparison: * Hand-written recursion: 6,587,809,112 bytes allocated in the heap Interestingly, this is *not* the best result!
Conclusion: We should probably add INLINE pragmas to Util and MonadUtils.
- I then looked at rejiggering the algorithm to keep the common case fast. This had a side effect of changing the zipWithAndUnzipM to mapAndUnzipM, from Control.Monad. To my surprise, this brought disaster! * Using `inline` and mapAndUnzipM: 7,463,047,432 bytes allocated in the heap * Hand-written recursion: 5,848,602,848 bytes allocated in the heap
That last number is better than the numbers above because of the algorithm streamlining. But, the inadequacy of mapAndUnzipM surprised me -- it already has an INLINE pragma in Control.Monad of course. Looking at -ddump-simpl, it seems that mapAndUnzipM was indeed getting inlined, but a call to `map` remained, perhaps causing extra allocation.
Conclusion: We should examine the implementation of mapAndUnzipM (and similar functions) in Control.Monad. Is it as fast as possible?
In the end, I was unable to bring the allocation numbers down to where they were before my work. This is because the flattener now deals in roles. Most of its behavior is the same between nominal and representational roles, so it seems silly (though very possible) to specialize the code to nominal to keep that path fast. Instead, I identified one key spot and made that go fast.
Thus, there is a 7% bump to memory usage on very-type-family-heavy code, compared to before my commit on Friday. (On more ordinary code, there is no noticeable change.)
Validating my patch locally now; will push when that's done.
Thanks, Richard
On Dec 16, 2014, at 10:41 AM, Joachim Breitner
wrote: Hi,
Am Dienstag, den 16.12.2014, 09:59 -0500 schrieb Richard Eisenberg:
On Dec 16, 2014, at 4:01 AM, Joachim Breitner
wrote: another guess (without looking at the code, sorry): Are they in the same module? I.e., can GHC specialize the code to your particular Monad?
No, they're not in the same module. I could also try moving the zipWithAndUnzipM function to the same module, and even specializing it by hand to the right monad.
I did mean zipWithAndUnzipM, so maybe yes: Try that.
(I find it hard to believe that any polymorphic monadic code should perform well, with those many calls to an unknown (>>=) with a function parameter, but maybe I’m too pessimistic here.)
Could that be preventing the fusing?
There is not going to be any fusing here, at least not list fusion; that would require your code to be written in terms of functions with fusion rules.
Greetings, Joachim
-- Joachim “nomeata” Breitner mail@joachim-breitner.de • http://www.joachim-breitner.de/ Jabber: nomeata@joachim-breitner.de • GPG-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

If you use INLINEABLE, that should make the function specialisable to a particular monad, even if it's in a different module. You shouldn't need INLINE for that.
I don't understand the difference between cases (2) and (3).
I am still suspicious of why there are so many calls to this one function that it, alone, is allocating a significant proportion of compilation of the entire run of GHC. Are you sure there isn't an algorithmic improvement to be had, to simply reduce the number of calls?
Simon
| -----Original Message-----
| From: ghc-devs [mailto:ghc-devs-bounces@haskell.org] On Behalf Of
| Richard Eisenberg
| Sent: 16 December 2014 21:46
| To: Joachim Breitner
| Cc: ghc-devs@haskell.org
| Subject: Re: performance regressions
|
| I've learned several very interesting things in this analysis.
|
| - Inlining polymorphic methods is very important. Here are some data
| points to back up that claim:
| * Original implementation using zipWithAndUnzipM: 8,472,613,440
| bytes allocated in the heap
| * Adding {-# INLINE #-} to the definition thereof: 6,639,253,488
| bytes allocated in the heap
| * Using `inline` at call site to force inlining: 6,281,539,792
| bytes allocated in the heap
|
| The middle step above allowed GHC to specialize zipWithAndUnzipM to my
| particular monad, but GHC didn't see fit to actually inline the
| function. Using `inline` forced it, to good effect. (I did not collect
| data on code sizes, but it wouldn't be hard to.)
|
| By comparison:
| * Hand-written recursion: 6,587,809,112 bytes allocated in the
| heap
| Interestingly, this is *not* the best result!
|
| Conclusion: We should probably add INLINE pragmas to Util and
| MonadUtils.
|
|
| - I then looked at rejiggering the algorithm to keep the common case
| fast. This had a side effect of changing the zipWithAndUnzipM to
| mapAndUnzipM, from Control.Monad. To my surprise, this brought
| disaster!
| * Using `inline` and mapAndUnzipM: 7,463,047,432 bytes
| allocated in the heap
| * Hand-written recursion: 5,848,602,848 bytes
| allocated in the heap
|
| That last number is better than the numbers above because of the
| algorithm streamlining. But, the inadequacy of mapAndUnzipM surprised
| me -- it already has an INLINE pragma in Control.Monad of course.
| Looking at -ddump-simpl, it seems that mapAndUnzipM was indeed getting
| inlined, but a call to `map` remained, perhaps causing extra
| allocation.
|
| Conclusion: We should examine the implementation of mapAndUnzipM (and
| similar functions) in Control.Monad. Is it as fast as possible?
|
|
|
| In the end, I was unable to bring the allocation numbers down to where
| they were before my work. This is because the flattener now deals in
| roles. Most of its behavior is the same between nominal and
| representational roles, so it seems silly (though very possible) to
| specialize the code to nominal to keep that path fast. Instead, I
| identified one key spot and made that go fast.
|
| Thus, there is a 7% bump to memory usage on very-type-family-heavy
| code, compared to before my commit on Friday. (On more ordinary code,
| there is no noticeable change.)
|
| Validating my patch locally now; will push when that's done.
|
| Thanks,
| Richard
|
| On Dec 16, 2014, at 10:41 AM, Joachim Breitner

By unsubstantiated guess is that INLINEABLE would have the same effect as INLINE here, as GHC doesn't see fit to actually inline the function, even with INLINE -- the big improvement seen between (1) and (2) is actually specialization, not inlining. The jump from (2) to (3) is actual inlining. Thus, it seems that GHC's heuristics for inlining aren't working out for the best here.
I've pushed my changes, though I agree with Simon that more research may uncover even more improvements here. I didn't focus on the number of calls because that number didn't regress. Will look into this soon.
Richard
On Dec 17, 2014, at 4:15 AM, Simon Peyton Jones
If you use INLINEABLE, that should make the function specialisable to a particular monad, even if it's in a different module. You shouldn't need INLINE for that.
I don't understand the difference between cases (2) and (3).
I am still suspicious of why there are so many calls to this one function that it, alone, is allocating a significant proportion of compilation of the entire run of GHC. Are you sure there isn't an algorithmic improvement to be had, to simply reduce the number of calls?
Simon
| -----Original Message----- | From: ghc-devs [mailto:ghc-devs-bounces@haskell.org] On Behalf Of | Richard Eisenberg | Sent: 16 December 2014 21:46 | To: Joachim Breitner | Cc: ghc-devs@haskell.org | Subject: Re: performance regressions | | I've learned several very interesting things in this analysis. | | - Inlining polymorphic methods is very important. Here are some data | points to back up that claim: | * Original implementation using zipWithAndUnzipM: 8,472,613,440 | bytes allocated in the heap | * Adding {-# INLINE #-} to the definition thereof: 6,639,253,488 | bytes allocated in the heap | * Using `inline` at call site to force inlining: 6,281,539,792 | bytes allocated in the heap | | The middle step above allowed GHC to specialize zipWithAndUnzipM to my | particular monad, but GHC didn't see fit to actually inline the | function. Using `inline` forced it, to good effect. (I did not collect | data on code sizes, but it wouldn't be hard to.) | | By comparison: | * Hand-written recursion: 6,587,809,112 bytes allocated in the | heap | Interestingly, this is *not* the best result! | | Conclusion: We should probably add INLINE pragmas to Util and | MonadUtils. | | | - I then looked at rejiggering the algorithm to keep the common case | fast. This had a side effect of changing the zipWithAndUnzipM to | mapAndUnzipM, from Control.Monad. To my surprise, this brought | disaster! | * Using `inline` and mapAndUnzipM: 7,463,047,432 bytes | allocated in the heap | * Hand-written recursion: 5,848,602,848 bytes | allocated in the heap | | That last number is better than the numbers above because of the | algorithm streamlining. But, the inadequacy of mapAndUnzipM surprised | me -- it already has an INLINE pragma in Control.Monad of course. | Looking at -ddump-simpl, it seems that mapAndUnzipM was indeed getting | inlined, but a call to `map` remained, perhaps causing extra | allocation. | | Conclusion: We should examine the implementation of mapAndUnzipM (and | similar functions) in Control.Monad. Is it as fast as possible? | | | | In the end, I was unable to bring the allocation numbers down to where | they were before my work. This is because the flattener now deals in | roles. Most of its behavior is the same between nominal and | representational roles, so it seems silly (though very possible) to | specialize the code to nominal to keep that path fast. Instead, I | identified one key spot and made that go fast. | | Thus, there is a 7% bump to memory usage on very-type-family-heavy | code, compared to before my commit on Friday. (On more ordinary code, | there is no noticeable change.) | | Validating my patch locally now; will push when that's done. | | Thanks, | Richard | | On Dec 16, 2014, at 10:41 AM, Joachim Breitner
wrote: | | > Hi, | > | > | > Am Dienstag, den 16.12.2014, 09:59 -0500 schrieb Richard Eisenberg: | >> On Dec 16, 2014, at 4:01 AM, Joachim Breitner wrote: | >> | >>> another guess (without looking at the code, sorry): Are they in | the | >>> same module? I.e., can GHC specialize the code to your particular | Monad? | > | >> No, they're not in the same module. I could also try moving the | >> zipWithAndUnzipM function to the same module, and even specializing | >> it by hand to the right monad. | > | > I did mean zipWithAndUnzipM, so maybe yes: Try that. | > | > (I find it hard to believe that any polymorphic monadic code should | > perform well, with those many calls to an unknown (>>=) with a | > function parameter, but maybe I'm too pessimistic here.) | > | > | >> Could that be preventing the fusing? | > | > There is not going to be any fusing here, at least not list fusion; | > that would require your code to be written in terms of functions | with | > fusion rules. | > | > Greetings, | > Joachim | > | > -- | > Joachim "nomeata" Breitner | > mail@joachim-breitner.de * http://www.joachim-breitner.de/ | > Jabber: nomeata@joachim-breitner.de * GPG-Key: 0xF0FBF51F Debian | > Developer: nomeata@debian.org | > | > _______________________________________________ | > ghc-devs mailing list | > ghc-devs@haskell.org | > http://www.haskell.org/mailman/listinfo/ghc-devs | | _______________________________________________ | ghc-devs mailing list | ghc-devs@haskell.org | http://www.haskell.org/mailman/listinfo/ghc-devs

I still would like to understand why INLINE does not make it inline. That's weird.
Eg way to reproduce.
Simion
| -----Original Message-----
| From: Richard Eisenberg [mailto:eir@cis.upenn.edu]
| Sent: 17 December 2014 15:56
| To: Simon Peyton Jones
| Cc: Joachim Breitner; ghc-devs@haskell.org
| Subject: Re: performance regressions
|
| By unsubstantiated guess is that INLINEABLE would have the same effect
| as INLINE here, as GHC doesn't see fit to actually inline the
| function, even with INLINE -- the big improvement seen between (1) and
| (2) is actually specialization, not inlining. The jump from (2) to (3)
| is actual inlining. Thus, it seems that GHC's heuristics for inlining
| aren't working out for the best here.
|
| I've pushed my changes, though I agree with Simon that more research
| may uncover even more improvements here. I didn't focus on the number
| of calls because that number didn't regress. Will look into this soon.
|
| Richard
|
| On Dec 17, 2014, at 4:15 AM, Simon Peyton Jones
|

Is it possible INLINE didn't inline the function because it's recursive? If
it were my function, I'd probably try a manual worker /wrapper.
On 07:59, Wed, Dec 17, 2014 Simon Peyton Jones
I still would like to understand why INLINE does not make it inline. That's weird.
Eg way to reproduce.
Simion
| -----Original Message----- | From: Richard Eisenberg [mailto:eir@cis.upenn.edu] | Sent: 17 December 2014 15:56 | To: Simon Peyton Jones | Cc: Joachim Breitner; ghc-devs@haskell.org | Subject: Re: performance regressions | | By unsubstantiated guess is that INLINEABLE would have the same effect | as INLINE here, as GHC doesn't see fit to actually inline the | function, even with INLINE -- the big improvement seen between (1) and | (2) is actually specialization, not inlining. The jump from (2) to (3) | is actual inlining. Thus, it seems that GHC's heuristics for inlining | aren't working out for the best here. | | I've pushed my changes, though I agree with Simon that more research | may uncover even more improvements here. I didn't focus on the number | of calls because that number didn't regress. Will look into this soon. | | Richard | | On Dec 17, 2014, at 4:15 AM, Simon Peyton Jones |
wrote: | | > If you use INLINEABLE, that should make the function specialisable | to a particular monad, even if it's in a different module. You | shouldn't need INLINE for that. | > | > I don't understand the difference between cases (2) and (3). | > | > I am still suspicious of why there are so many calls to this one | function that it, alone, is allocating a significant proportion of | compilation of the entire run of GHC. Are you sure there isn't an | algorithmic improvement to be had, to simply reduce the number of | calls? | > | > Simon | > | > | -----Original Message----- | > | From: ghc-devs [mailto:ghc-devs-bounces@haskell.org] On Behalf Of | > | Richard Eisenberg | > | Sent: 16 December 2014 21:46 | > | To: Joachim Breitner | > | Cc: ghc-devs@haskell.org | > | Subject: Re: performance regressions | > | | > | I've learned several very interesting things in this analysis. | > | | > | - Inlining polymorphic methods is very important. Here are some | > | data points to back up that claim: | > | * Original implementation using zipWithAndUnzipM: | 8,472,613,440 | > | bytes allocated in the heap | > | * Adding {-# INLINE #-} to the definition thereof: | 6,639,253,488 | > | bytes allocated in the heap | > | * Using `inline` at call site to force inlining: | 6,281,539,792 | > | bytes allocated in the heap | > | | > | The middle step above allowed GHC to specialize zipWithAndUnzipM | to | > | my particular monad, but GHC didn't see fit to actually inline | the | > | function. Using `inline` forced it, to good effect. (I did not | > | collect data on code sizes, but it wouldn't be hard to.) | > | | > | By comparison: | > | * Hand-written recursion: 6,587,809,112 bytes allocated in | the | > | heap | > | Interestingly, this is *not* the best result! | > | | > | Conclusion: We should probably add INLINE pragmas to Util and | > | MonadUtils. | > | | > | | > | - I then looked at rejiggering the algorithm to keep the common | > | case fast. This had a side effect of changing the | zipWithAndUnzipM | > | to mapAndUnzipM, from Control.Monad. To my surprise, this brought | > | disaster! | > | * Using `inline` and mapAndUnzipM: 7,463,047,432 bytes | > | allocated in the heap | > | * Hand-written recursion: 5,848,602,848 bytes | > | allocated in the heap | > | | > | That last number is better than the numbers above because of the | > | algorithm streamlining. But, the inadequacy of mapAndUnzipM | > | surprised me -- it already has an INLINE pragma in Control.Monad | of course. | > | Looking at -ddump-simpl, it seems that mapAndUnzipM was indeed | > | getting inlined, but a call to `map` remained, perhaps causing | > | extra allocation. | > | | > | Conclusion: We should examine the implementation of mapAndUnzipM | > | (and similar functions) in Control.Monad. Is it as fast as | possible? | > | | > | | > | | > | In the end, I was unable to bring the allocation numbers down to | > | where they were before my work. This is because the flattener now | > | deals in roles. Most of its behavior is the same between nominal | > | and representational roles, so it seems silly (though very | > | possible) to specialize the code to nominal to keep that path | fast. | > | Instead, I identified one key spot and made that go fast. | > | | > | Thus, there is a 7% bump to memory usage on very-type-family- | heavy | > | code, compared to before my commit on Friday. (On more ordinary | > | code, there is no noticeable change.) | > | | > | Validating my patch locally now; will push when that's done. | > | | > | Thanks, | > | Richard | > | | > | On Dec 16, 2014, at 10:41 AM, Joachim Breitner | breitner.de> wrote: | > | | > | > Hi, | > | > | > | > | > | > Am Dienstag, den 16.12.2014, 09:59 -0500 schrieb Richard | Eisenberg: | > | >> On Dec 16, 2014, at 4:01 AM, Joachim Breitner | breitner.de> wrote: | > | >> | > | >>> another guess (without looking at the code, sorry): Are they | in | > | the >>> same module? I.e., can GHC specialize the code to your | > | particular Monad? | > | > | > | >> No, they're not in the same module. I could also try moving | the | > | >> zipWithAndUnzipM function to the same module, and even | > | specializing >> it by hand to the right monad. | > | > | > | > I did mean zipWithAndUnzipM, so maybe yes: Try that. | > | > | > | > (I find it hard to believe that any polymorphic monadic code | > | should > perform well, with those many calls to an unknown (>>=) | > | with a > function parameter, but maybe I'm too pessimistic here.) | > | > > >> Could that be preventing the fusing? | > | > | > | > There is not going to be any fusing here, at least not list | > | fusion; > that would require your code to be written in terms of | > | functions with > fusion rules. | > | > | > | > Greetings, | > | > Joachim | > | > | > | > -- | > | > Joachim "nomeata" Breitner | > | > mail@joachim-breitner.de * http://www.joachim-breitner.de/ > | > | Jabber: nomeata@joachim-breitner.de * GPG-Key: 0xF0FBF51F Debian | > | > Developer: nomeata@debian.org > > | > | _______________________________________________ | > | > ghc-devs mailing list | > | > ghc-devs@haskell.org | > | > http://www.haskell.org/mailman/listinfo/ghc-devs | > | | > | _______________________________________________ | > | ghc-devs mailing list | > | ghc-devs@haskell.org | > | http://www.haskell.org/mailman/listinfo/ghc-devs | > _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

I've gained a little more insight here, but only a little.
INLINE doesn't work because zipWithAndUnzipM is recursive. Oddly, using worker/wrapper made allocations go *up*, so that's somehow not the answer.
When I use `inline`, GHC will inline the function once; recursive calls remain. This seems to the local minimum in allocations.
Using `inline` gives roughly a 7% allocation level change on this pathological case.
To reproduce, just compile HEAD's TcFlatten with and without the call to `inline`. You can see, using -ddump-simpl, what GHC is doing. Then, you can check the allocation numbers by compiling perf/compiler/T9872a.hs. My tests were all using the default build settings, with an unmodified build.mk, as that seemed to be the most performant setting.
In other news, a slight change to the algorithm (see #9872, comment 23) makes a 50% improvement. Will hopefully commit tonight, after a full local validation run.
Richard
On Dec 17, 2014, at 10:59 AM, Simon Peyton Jones
I still would like to understand why INLINE does not make it inline. That's weird.
Eg way to reproduce.
Simion
| -----Original Message----- | From: Richard Eisenberg [mailto:eir@cis.upenn.edu] | Sent: 17 December 2014 15:56 | To: Simon Peyton Jones | Cc: Joachim Breitner; ghc-devs@haskell.org | Subject: Re: performance regressions | | By unsubstantiated guess is that INLINEABLE would have the same effect | as INLINE here, as GHC doesn't see fit to actually inline the | function, even with INLINE -- the big improvement seen between (1) and | (2) is actually specialization, not inlining. The jump from (2) to (3) | is actual inlining. Thus, it seems that GHC's heuristics for inlining | aren't working out for the best here. | | I've pushed my changes, though I agree with Simon that more research | may uncover even more improvements here. I didn't focus on the number | of calls because that number didn't regress. Will look into this soon. | | Richard | | On Dec 17, 2014, at 4:15 AM, Simon Peyton Jones |
wrote: | | > If you use INLINEABLE, that should make the function specialisable | to a particular monad, even if it's in a different module. You | shouldn't need INLINE for that. | > | > I don't understand the difference between cases (2) and (3). | > | > I am still suspicious of why there are so many calls to this one | function that it, alone, is allocating a significant proportion of | compilation of the entire run of GHC. Are you sure there isn't an | algorithmic improvement to be had, to simply reduce the number of | calls? | > | > Simon | > | > | -----Original Message----- | > | From: ghc-devs [mailto:ghc-devs-bounces@haskell.org] On Behalf Of | > | Richard Eisenberg | > | Sent: 16 December 2014 21:46 | > | To: Joachim Breitner | > | Cc: ghc-devs@haskell.org | > | Subject: Re: performance regressions | > | | > | I've learned several very interesting things in this analysis. | > | | > | - Inlining polymorphic methods is very important. Here are some | > | data points to back up that claim: | > | * Original implementation using zipWithAndUnzipM: | 8,472,613,440 | > | bytes allocated in the heap | > | * Adding {-# INLINE #-} to the definition thereof: | 6,639,253,488 | > | bytes allocated in the heap | > | * Using `inline` at call site to force inlining: | 6,281,539,792 | > | bytes allocated in the heap | > | | > | The middle step above allowed GHC to specialize zipWithAndUnzipM | to | > | my particular monad, but GHC didn't see fit to actually inline | the | > | function. Using `inline` forced it, to good effect. (I did not | > | collect data on code sizes, but it wouldn't be hard to.) | > | | > | By comparison: | > | * Hand-written recursion: 6,587,809,112 bytes allocated in | the | > | heap | > | Interestingly, this is *not* the best result! | > | | > | Conclusion: We should probably add INLINE pragmas to Util and | > | MonadUtils. | > | | > | | > | - I then looked at rejiggering the algorithm to keep the common | > | case fast. This had a side effect of changing the | zipWithAndUnzipM | > | to mapAndUnzipM, from Control.Monad. To my surprise, this brought | > | disaster! | > | * Using `inline` and mapAndUnzipM: 7,463,047,432 bytes | > | allocated in the heap | > | * Hand-written recursion: 5,848,602,848 bytes | > | allocated in the heap | > | | > | That last number is better than the numbers above because of the | > | algorithm streamlining. But, the inadequacy of mapAndUnzipM | > | surprised me -- it already has an INLINE pragma in Control.Monad | of course. | > | Looking at -ddump-simpl, it seems that mapAndUnzipM was indeed | > | getting inlined, but a call to `map` remained, perhaps causing | > | extra allocation. | > | | > | Conclusion: We should examine the implementation of mapAndUnzipM | > | (and similar functions) in Control.Monad. Is it as fast as | possible? | > | | > | | > | | > | In the end, I was unable to bring the allocation numbers down to | > | where they were before my work. This is because the flattener now | > | deals in roles. Most of its behavior is the same between nominal | > | and representational roles, so it seems silly (though very | > | possible) to specialize the code to nominal to keep that path | fast. | > | Instead, I identified one key spot and made that go fast. | > | | > | Thus, there is a 7% bump to memory usage on very-type-family- | heavy | > | code, compared to before my commit on Friday. (On more ordinary | > | code, there is no noticeable change.) | > | | > | Validating my patch locally now; will push when that's done. | > | | > | Thanks, | > | Richard | > | | > | On Dec 16, 2014, at 10:41 AM, Joachim Breitner | breitner.de> wrote: | > | | > | > Hi, | > | > | > | > | > | > Am Dienstag, den 16.12.2014, 09:59 -0500 schrieb Richard | Eisenberg: | > | >> On Dec 16, 2014, at 4:01 AM, Joachim Breitner | breitner.de> wrote: | > | >> | > | >>> another guess (without looking at the code, sorry): Are they | in | > | the >>> same module? I.e., can GHC specialize the code to your | > | particular Monad? | > | > | > | >> No, they're not in the same module. I could also try moving | the | > | >> zipWithAndUnzipM function to the same module, and even | > | specializing >> it by hand to the right monad. | > | > | > | > I did mean zipWithAndUnzipM, so maybe yes: Try that. | > | > | > | > (I find it hard to believe that any polymorphic monadic code | > | should > perform well, with those many calls to an unknown (>>=) | > | with a > function parameter, but maybe I'm too pessimistic here.) | > | > > >> Could that be preventing the fusing? | > | > | > | > There is not going to be any fusing here, at least not list | > | fusion; > that would require your code to be written in terms of | > | functions with > fusion rules. | > | > | > | > Greetings, | > | > Joachim | > | > | > | > -- | > | > Joachim "nomeata" Breitner | > | > mail@joachim-breitner.de * http://www.joachim-breitner.de/ > | > | Jabber: nomeata@joachim-breitner.de * GPG-Key: 0xF0FBF51F Debian | > | > Developer: nomeata@debian.org > > | > | _______________________________________________ | > | > ghc-devs mailing list | > | > ghc-devs@haskell.org | > | > http://www.haskell.org/mailman/listinfo/ghc-devs | > | | > | _______________________________________________ | > | ghc-devs mailing list | > | ghc-devs@haskell.org | > | http://www.haskell.org/mailman/listinfo/ghc-devs | >

On 2014-12-16 at 22:45:36 +0100, Richard Eisenberg wrote:
I've learned several very interesting things in this analysis.
- Inlining polymorphic methods is very important.
otoh, there are cases where marking methods INLINE have catastrophic effects; the following https://github.com/kolmodin/binary/commit/48c02966512a67b018fcdf093fab8d34bd... was necessary a few months ago, as otherwise GHC HEAD's compile-time memory usage would explode: https://ghc.haskell.org/trac/ghc/ticket/9630

Hi, Am Montag, den 15.12.2014, 23:48 -0500 schrieb Richard Eisenberg:
Can anyone tell me: why? Have I made some horrible mistake in the implementation? And, relatedly: how can I fix this? I want to learn from this experience how to avoid this problem next time...
another guess (without looking at the code, sorry): Are they in the same module? I.e., can GHC specialize the code to your particular Monad? Greetings, Joachim -- Joachim Breitner e-Mail: mail@joachim-breitner.de Homepage: http://www.joachim-breitner.de Jabber-ID: nomeata@joachim-breitner.de

| Using this combinator instead of writing the algorithm directly cost
| me 30% allocation overhead!
That seems surprising. I'd build a profiled compiler (GhcProfiled=YES) and see what it says.
If it increases allocation by 30% overall, there must be a LOT of calls to this function. Should there be so many?
S
| -----Original Message-----
| From: ghc-devs [mailto:ghc-devs-bounces@haskell.org] On Behalf Of
| Richard Eisenberg
| Sent: 16 December 2014 04:49
| To: Ben Gamari
| Cc: Joachim Breitner; ghc-devs@haskell.org
| Subject: Re: performance regressions
|
| I've made progress, but still need some help.
|
| It turns out that a monadic combinator (that I wrote) is mostly
| responsible:
|
| > zipWithAndUnzipM :: Monad m
| > => (a -> b -> m (c, d)) -> [a] -> [b] -> m ([c],
| [d])
| > zipWithAndUnzipM f (x:xs) (y:ys)
| > = do { (c, d) <- f x y
| > ; (cs, ds) <- zipWithAndUnzipM f xs ys
| > ; return (c:cs, d:ds) }
| > zipWithAndUnzipM _ _ _ = return ([], [])
| >
|
| Using this combinator instead of writing the algorithm directly cost
| me 30% allocation overhead!
|
| Can anyone tell me: why? Have I made some horrible mistake in the
| implementation?
| And, relatedly: how can I fix this? I want to learn from this
| experience how to avoid this problem next time...
|
| Unfortunately, my commit causes 50% overhead, not 30%, so I'm not out
| of the woods yet. Hopefully, another 20% of good news tomorrow.
|
| Thanks!
| Richard
|
| On Dec 15, 2014, at 11:33 AM, Ben Gamari
participants (8)
-
Ben Gamari
-
Carter Schonwald
-
Herbert Valerio Riedel
-
Jan Stolarek
-
Joachim Breitner
-
John Lato
-
Richard Eisenberg
-
Simon Peyton Jones