cross module optimization issues

Hello, I have a problem with a package I'm working on, and I don't have any idea how to sort out the current problem. One part of my package is in one monolithic module, without an export list, which works fine. However, when I've started to separate out certain functions into another module, and added an export list to one of the modules, which dramatically decreases performance. The memory behavior (as shown by -hT) is also quite different, with substantial memory usage by "FUN_2_0". Are there any suggestions as to how I could improve this? Thanks, John

jwlato:
Hello,
I have a problem with a package I'm working on, and I don't have any idea how to sort out the current problem.
One part of my package is in one monolithic module, without an export list, which works fine. However, when I've started to separate out certain functions into another module, and added an export list to one of the modules, which dramatically decreases performance. The memory behavior (as shown by -hT) is also quite different, with substantial memory usage by "FUN_2_0". Are there any suggestions as to how I could improve this?
Are you compiling with aggressive cross-module optimisations on (e.g. -O2)? You may have to add explicit inlining pragmas (check the Core output), to ensure key functions are exported in their entirety. -- Don

On Sat, Nov 15, 2008 at 10:09 PM, Don Stewart
jwlato:
Hello,
I have a problem with a package I'm working on, and I don't have any idea how to sort out the current problem.
One part of my package is in one monolithic module, without an export list, which works fine. However, when I've started to separate out certain functions into another module, and added an export list to one of the modules, which dramatically decreases performance. The memory behavior (as shown by -hT) is also quite different, with substantial memory usage by "FUN_2_0". Are there any suggestions as to how I could improve this?
Are you compiling with aggressive cross-module optimisations on (e.g. -O2)? You may have to add explicit inlining pragmas (check the Core output), to ensure key functions are exported in their entirety.
Thanks for the reply. I'm compiling with -O2 -Wall. After looking at the Core output, I think I've found the key difference. A function that is bound in a "where" statement is different between the monolithic and split sources. I have no idea why, though. I'll experiment with a few different things to see if I can get this resolved. John

| I'm compiling with -O2 -Wall. After looking at the Core output, I | think I've found the key difference. A function that is bound in a | "where" statement is different between the monolithic and split | sources. I have no idea why, though. I'll experiment with a few | different things to see if I can get this resolved. In general, splitting code across modules should not make programs less efficient -- as Don says, GHC does quite aggressive cross-module inlining. There is one exception, though. If a non-exported non-recursive function is called exactly once, then it is inlined *regardless of size*, because doing so does not cause code duplication. But if it's exported and is large, then its inlining is not exposed -- and even if it were it might not be inlined, because doing so duplicates its code an unknown number of times. You can change the threshold for (a) exposing and (b) using an inlining, with flags -funfolding-creation-threshold and -funfolding-use-threshold respectively. If you find there's something else going on then I'm all ears. Simon

On Wed, Nov 19, 2008 at 4:17 PM, Simon Peyton-Jones
| I'm compiling with -O2 -Wall. After looking at the Core output, I | think I've found the key difference. A function that is bound in a | "where" statement is different between the monolithic and split | sources. I have no idea why, though. I'll experiment with a few | different things to see if I can get this resolved.
In general, splitting code across modules should not make programs less efficient -- as Don says, GHC does quite aggressive cross-module inlining.
There is one exception, though. If a non-exported non-recursive function is called exactly once, then it is inlined *regardless of size*, because doing so does not cause code duplication. But if it's exported and is large, then its inlining is not exposed -- and even if it were it might not be inlined, because doing so duplicates its code an unknown number of times. You can change the threshold for (a) exposing and (b) using an inlining, with flags -funfolding-creation-threshold and -funfolding-use-threshold respectively.
If you find there's something else going on then I'm all ears.
Simon
I did finally find the changes that make a difference. I think it's safe to say that I have no idea what's actually going on, so I'll just report my results and let others try to figure it out. I tried upping the thresholds mentioned, up to -funfolding-creation-threshold 200 -funfolding-use-threshold 100. This didn't seem to make any performance difference (I didn't check the core output). This project is based on Oleg's Iteratee code; I started using his IterateeM.hs and Enumerator.hs files and added my own stuff to Enumerator.hs (thanks Oleg, great work as always). When I started cleaning up by moving my functions from Enumerator.hs to MyEnum.hs, my minimal test case increased from 19s to 43s. I've found two factors that contributed. When I was cleaning up, I also removed a bunch of unused functions from IterateeM.hs (some of the test functions and functions specific to his running example of HTTP encoding). When I added those functions back in, and added INLINE pragmas to the exported functions in MyEnum.hs, I got the performance back. In general I hadn't added export lists to the modules yet, so all functions should have been exported. So it seems that somehow the unused functions in IterateeM.hs are affecting how the functions I care about get implemented (or exported). I did not expect that. Next step for me is to see what happens if I INLINE the functions I'm exporting and remove the others, I suppose. Thank you Simon and Don for your advice, especially since I'm pretty far over my head at this point. John

| This project is based on Oleg's Iteratee code; I started using his | IterateeM.hs and Enumerator.hs files and added my own stuff to | Enumerator.hs (thanks Oleg, great work as always). When I started | cleaning up by moving my functions from Enumerator.hs to MyEnum.hs, my | minimal test case increased from 19s to 43s. | | I've found two factors that contributed. When I was cleaning up, I | also removed a bunch of unused functions from IterateeM.hs (some of | the test functions and functions specific to his running example of | HTTP encoding). When I added those functions back in, and added | INLINE pragmas to the exported functions in MyEnum.hs, I got the | performance back. | | In general I hadn't added export lists to the modules yet, so all | functions should have been exported. I'm totally snowed under with backlog from my recent absence, so I can't look at this myself, but if anyone else wants to I'd be happy to support with advice and suggestions. In general, having an explicit export list is good for performance. I typed an extra section in the GHC performance resource http://haskell.org/haskellwiki/Performance/GHC to explain why. In general that page is where we should document user advice for performance in GHC. I can't explain why *adding* unused functions would change performance though! Simon

Hi John, I'm vaguely curious, and have next week off, so if you can provide the code, and directions for running in both variants and the test case, I'll take a look. Please email me at ndmitchell -AT- gmail.com though, as I loose this email address at 11pm tonight :-) Thanks Neil
-----Original Message----- From: glasgow-haskell-users-bounces@haskell.org [mailto:glasgow-haskell-users-bounces@haskell.org] On Behalf Of Simon Peyton-Jones Sent: 21 November 2008 10:34 am To: John Lato Cc: glasgow-haskell-users@haskell.org; Don Stewart Subject: RE: cross module optimization issues
| This project is based on Oleg's Iteratee code; I started using his | IterateeM.hs and Enumerator.hs files and added my own stuff to | Enumerator.hs (thanks Oleg, great work as always). When I started | cleaning up by moving my functions from Enumerator.hs to MyEnum.hs, my | minimal test case increased from 19s to 43s. | | I've found two factors that contributed. When I was cleaning up, I | also removed a bunch of unused functions from IterateeM.hs (some of | the test functions and functions specific to his running example of | HTTP encoding). When I added those functions back in, and added | INLINE pragmas to the exported functions in MyEnum.hs, I got the | performance back. | | In general I hadn't added export lists to the modules yet, so all | functions should have been exported.
I'm totally snowed under with backlog from my recent absence, so I can't look at this myself, but if anyone else wants to I'd be happy to support with advice and suggestions.
In general, having an explicit export list is good for performance. I typed an extra section in the GHC performance resource http://haskell.org/haskellwiki/Performance/GHC to explain why. In general that page is where we should document user advice for performance in GHC.
I can't explain why *adding* unused functions would change performance though!
Simon
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
============================================================================== Please access the attached hyperlink for an important electronic communications disclaimer: http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html ==============================================================================

Hi
I've talked to John a bit, and discussed test cases etc. I've tracked
this down a little way.
Given the attached file, compiling witih SHORT_EXPORT_LIST makes the
code go _slower_. By exporting the "print_lines" function the code
doubles in speed. This runs against everything I was expecting, and
that Simon has described.
Taking a look at the .hi files for the two alternatives, there are two
differences:
1) In the faster .hi file, the body of print_lines is exported. This
is reasonable and expected.
2) In the faster .hi file, there are additional specialisations, which
seemingly have little/nothing to do with print_lines, but are omitted
if it is not exported:
"SPEC >>= [GHC.IOBase.IO]" ALWAYS forall @ el
$dMonad :: GHC.Base.Monad GHC.IOBase.IO
Sound.IterateeM.>>= @ GHC.IOBase.IO @ el $dMonad
= Sound.IterateeM.a
`cast`
(forall el1 a b.
Sound.IterateeM.IterateeGM el1 GHC.IOBase.IO a
-> (a -> Sound.IterateeM.IterateeGM el1 GHC.IOBase.IO b)
-> trans
(sym ((GHC.IOBase.:CoIO)
(Sound.IterateeM.IterateeG el1 GHC.IOBase.IO b)))
(sym ((Sound.IterateeM.:CoIterateeGM) el1 GHC.IOBase.IO b)))
@ el
"SPEC Sound.IterateeM.$f2 [GHC.IOBase.IO]" ALWAYS forall @ el
$dMonad ::
GHC.Base.Monad GHC.IOBase.IO
Sound.IterateeM.$f2 @ GHC.IOBase.IO @ el $dMonad
= Sound.IterateeM.$s$f2 @ el
"SPEC Sound.IterateeM.$f2 [GHC.IOBase.IO]" ALWAYS forall @ el
$dMonad ::
GHC.Base.Monad GHC.IOBase.IO
Sound.IterateeM.$f2 @ GHC.IOBase.IO @ el $dMonad
= Sound.IterateeM.$s$f21 @ el
"SPEC Sound.IterateeM.liftI [GHC.IOBase.IO]" ALWAYS forall @ el
@ a
$dMonad ::
GHC.Base.Monad GHC.IOBase.IO
Sound.IterateeM.liftI @ GHC.IOBase.IO @ el @ a $dMonad
= Sound.IterateeM.$sliftI @ el @ a
"SPEC return [GHC.IOBase.IO]" ALWAYS forall @ el
$dMonad :: GHC.Base.Monad
GHC.IOBase.IO
Sound.IterateeM.return @ GHC.IOBase.IO @ el $dMonad
= Sound.IterateeM.a7
`cast`
(forall el1 a.
a
-> trans
(sym ((GHC.IOBase.:CoIO)
(Sound.IterateeM.IterateeG el1 GHC.IOBase.IO a)))
(sym ((Sound.IterateeM.:CoIterateeGM) el1 GHC.IOBase.IO a)))
@ el
My guess is that these cause the slowdown - but is there any reason
that print_lines not being exported should cause them to be omitted?
All these tests were run on GHC 6.10.1 with -O2.
Thanks
Neil
On Fri, Nov 21, 2008 at 10:33 AM, Simon Peyton-Jones
| This project is based on Oleg's Iteratee code; I started using his | IterateeM.hs and Enumerator.hs files and added my own stuff to | Enumerator.hs (thanks Oleg, great work as always). When I started | cleaning up by moving my functions from Enumerator.hs to MyEnum.hs, my | minimal test case increased from 19s to 43s. | | I've found two factors that contributed. When I was cleaning up, I | also removed a bunch of unused functions from IterateeM.hs (some of | the test functions and functions specific to his running example of | HTTP encoding). When I added those functions back in, and added | INLINE pragmas to the exported functions in MyEnum.hs, I got the | performance back. | | In general I hadn't added export lists to the modules yet, so all | functions should have been exported.
I'm totally snowed under with backlog from my recent absence, so I can't look at this myself, but if anyone else wants to I'd be happy to support with advice and suggestions.
In general, having an explicit export list is good for performance. I typed an extra section in the GHC performance resource http://haskell.org/haskellwiki/Performance/GHC to explain why. In general that page is where we should document user advice for performance in GHC.
I can't explain why *adding* unused functions would change performance though!
Simon
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

The specialisations are indeed caused (indirectly) by the presence of print_lines. If print_lines is dead code (as it is when print_lines is not exported), then there are no calls to the overloaded functions at these specialised types, and so you don't get the specialised versions. You can get specialised versions by a SPECIALISE pragma, or SPECIALISE INSTANCE
Does that make sense?
Simon
| -----Original Message-----
| From: Neil Mitchell [mailto:ndmitchell@gmail.com]
| Sent: 28 November 2008 09:48
| To: Simon Peyton-Jones
| Cc: John Lato; glasgow-haskell-users@haskell.org; Don Stewart
| Subject: Re: cross module optimization issues
|
| Hi
|
| I've talked to John a bit, and discussed test cases etc. I've tracked
| this down a little way.
|
| Given the attached file, compiling witih SHORT_EXPORT_LIST makes the
| code go _slower_. By exporting the "print_lines" function the code
| doubles in speed. This runs against everything I was expecting, and
| that Simon has described.
|
| Taking a look at the .hi files for the two alternatives, there are two
| differences:
|
| 1) In the faster .hi file, the body of print_lines is exported. This
| is reasonable and expected.
|
| 2) In the faster .hi file, there are additional specialisations, which
| seemingly have little/nothing to do with print_lines, but are omitted
| if it is not exported:
|
| "SPEC >>= [GHC.IOBase.IO]" ALWAYS forall @ el
| $dMonad :: GHC.Base.Monad GHC.IOBase.IO
| Sound.IterateeM.>>= @ GHC.IOBase.IO @ el $dMonad
| = Sound.IterateeM.a
| `cast`
| (forall el1 a b.
| Sound.IterateeM.IterateeGM el1 GHC.IOBase.IO a
| -> (a -> Sound.IterateeM.IterateeGM el1 GHC.IOBase.IO b)
| -> trans
| (sym ((GHC.IOBase.:CoIO)
| (Sound.IterateeM.IterateeG el1 GHC.IOBase.IO b)))
| (sym ((Sound.IterateeM.:CoIterateeGM) el1 GHC.IOBase.IO b)))
| @ el
| "SPEC Sound.IterateeM.$f2 [GHC.IOBase.IO]" ALWAYS forall @ el
| $dMonad ::
| GHC.Base.Monad GHC.IOBase.IO
| Sound.IterateeM.$f2 @ GHC.IOBase.IO @ el $dMonad
| = Sound.IterateeM.$s$f2 @ el
| "SPEC Sound.IterateeM.$f2 [GHC.IOBase.IO]" ALWAYS forall @ el
| $dMonad ::
| GHC.Base.Monad GHC.IOBase.IO
| Sound.IterateeM.$f2 @ GHC.IOBase.IO @ el $dMonad
| = Sound.IterateeM.$s$f21 @ el
| "SPEC Sound.IterateeM.liftI [GHC.IOBase.IO]" ALWAYS forall @ el
| @ a
| $dMonad ::
| GHC.Base.Monad GHC.IOBase.IO
| Sound.IterateeM.liftI @ GHC.IOBase.IO @ el @ a $dMonad
| = Sound.IterateeM.$sliftI @ el @ a
| "SPEC return [GHC.IOBase.IO]" ALWAYS forall @ el
| $dMonad :: GHC.Base.Monad
| GHC.IOBase.IO
| Sound.IterateeM.return @ GHC.IOBase.IO @ el $dMonad
| = Sound.IterateeM.a7
| `cast`
| (forall el1 a.
| a
| -> trans
| (sym ((GHC.IOBase.:CoIO)
| (Sound.IterateeM.IterateeG el1 GHC.IOBase.IO a)))
| (sym ((Sound.IterateeM.:CoIterateeGM) el1 GHC.IOBase.IO a)))
| @ el
|
| My guess is that these cause the slowdown - but is there any reason
| that print_lines not being exported should cause them to be omitted?
|
| All these tests were run on GHC 6.10.1 with -O2.
|
| Thanks
|
| Neil
|
|
| On Fri, Nov 21, 2008 at 10:33 AM, Simon Peyton-Jones
|

Neil, thank you very much for taking the time to look at this; I
greatly appreciate it.
One thing I don't understand is why the specializations are caused by
print_lines. I suppose the optimizer can infer something which it
couldn't otherwise.
If I read this properly, the functions being specialized are liftI,
(>>=), return, and $f2. One thing I'm not sure about is when INLINE
provides the desired optimal behavior, as opposed to SPECIALIZE. The
monad functions are defined in the Monad instance, and thus aren't
currently INLINE'd or SPECIALIZE'd. However, if they are separate
functions, would INLINE be sufficient? Would that give the optimizer
enough to work with the derive the specializations on its own? I'll
have some time to experiment with this myself tomorrow, but I'd
appreciate some direction (rather than guessing blindly).
What is "$f2"? I've seen that appear before, but I'm not sure where
it comes from.
Thanks,
John
On Fri, Nov 28, 2008 at 10:31 AM, Simon Peyton-Jones
The specialisations are indeed caused (indirectly) by the presence of print_lines. If print_lines is dead code (as it is when print_lines is not exported), then there are no calls to the overloaded functions at these specialised types, and so you don't get the specialised versions. You can get specialised versions by a SPECIALISE pragma, or SPECIALISE INSTANCE
Does that make sense?
Simon
| -----Original Message----- | From: Neil Mitchell [mailto:ndmitchell@gmail.com] | Sent: 28 November 2008 09:48 | To: Simon Peyton-Jones | Cc: John Lato; glasgow-haskell-users@haskell.org; Don Stewart | Subject: Re: cross module optimization issues | | Hi | | I've talked to John a bit, and discussed test cases etc. I've tracked | this down a little way. | | Given the attached file, compiling witih SHORT_EXPORT_LIST makes the | code go _slower_. By exporting the "print_lines" function the code | doubles in speed. This runs against everything I was expecting, and | that Simon has described. | | Taking a look at the .hi files for the two alternatives, there are two | differences: | | 1) In the faster .hi file, the body of print_lines is exported. This | is reasonable and expected. | | 2) In the faster .hi file, there are additional specialisations, which | seemingly have little/nothing to do with print_lines, but are omitted | if it is not exported: | | "SPEC >>= [GHC.IOBase.IO]" ALWAYS forall @ el | $dMonad :: GHC.Base.Monad GHC.IOBase.IO | Sound.IterateeM.>>= @ GHC.IOBase.IO @ el $dMonad | = Sound.IterateeM.a | `cast` | (forall el1 a b. | Sound.IterateeM.IterateeGM el1 GHC.IOBase.IO a | -> (a -> Sound.IterateeM.IterateeGM el1 GHC.IOBase.IO b) | -> trans | (sym ((GHC.IOBase.:CoIO) | (Sound.IterateeM.IterateeG el1 GHC.IOBase.IO b))) | (sym ((Sound.IterateeM.:CoIterateeGM) el1 GHC.IOBase.IO b))) | @ el | "SPEC Sound.IterateeM.$f2 [GHC.IOBase.IO]" ALWAYS forall @ el | $dMonad :: | GHC.Base.Monad GHC.IOBase.IO | Sound.IterateeM.$f2 @ GHC.IOBase.IO @ el $dMonad | = Sound.IterateeM.$s$f2 @ el | "SPEC Sound.IterateeM.$f2 [GHC.IOBase.IO]" ALWAYS forall @ el | $dMonad :: | GHC.Base.Monad GHC.IOBase.IO | Sound.IterateeM.$f2 @ GHC.IOBase.IO @ el $dMonad | = Sound.IterateeM.$s$f21 @ el | "SPEC Sound.IterateeM.liftI [GHC.IOBase.IO]" ALWAYS forall @ el | @ a | $dMonad :: | GHC.Base.Monad GHC.IOBase.IO | Sound.IterateeM.liftI @ GHC.IOBase.IO @ el @ a $dMonad | = Sound.IterateeM.$sliftI @ el @ a | "SPEC return [GHC.IOBase.IO]" ALWAYS forall @ el | $dMonad :: GHC.Base.Monad | GHC.IOBase.IO | Sound.IterateeM.return @ GHC.IOBase.IO @ el $dMonad | = Sound.IterateeM.a7 | `cast` | (forall el1 a. | a | -> trans | (sym ((GHC.IOBase.:CoIO) | (Sound.IterateeM.IterateeG el1 GHC.IOBase.IO a))) | (sym ((Sound.IterateeM.:CoIterateeGM) el1 GHC.IOBase.IO a))) | @ el | | My guess is that these cause the slowdown - but is there any reason | that print_lines not being exported should cause them to be omitted? | | All these tests were run on GHC 6.10.1 with -O2. | | Thanks | | Neil | | | On Fri, Nov 21, 2008 at 10:33 AM, Simon Peyton-Jones |
wrote: | > | This project is based on Oleg's Iteratee code; I started using his | > | IterateeM.hs and Enumerator.hs files and added my own stuff to | > | Enumerator.hs (thanks Oleg, great work as always). When I started | > | cleaning up by moving my functions from Enumerator.hs to MyEnum.hs, my | > | minimal test case increased from 19s to 43s. | > | | > | I've found two factors that contributed. When I was cleaning up, I | > | also removed a bunch of unused functions from IterateeM.hs (some of | > | the test functions and functions specific to his running example of | > | HTTP encoding). When I added those functions back in, and added | > | INLINE pragmas to the exported functions in MyEnum.hs, I got the | > | performance back. | > | | > | In general I hadn't added export lists to the modules yet, so all | > | functions should have been exported. | > | > I'm totally snowed under with backlog from my recent absence, so I can't look at this | myself, but if anyone else wants to I'd be happy to support with advice and suggestions. | > | > In general, having an explicit export list is good for performance. I typed an extra section | in the GHC performance resource http://haskell.org/haskellwiki/Performance/GHC to explain why. | In general that page is where we should document user advice for performance in GHC. | > | > I can't explain why *adding* unused functions would change performance though! | > | > Simon | > | > | > _______________________________________________ | > Glasgow-haskell-users mailing list | > Glasgow-haskell-users@haskell.org | > http://www.haskell.org/mailman/listinfo/glasgow-haskell-users | >

The $f2 comes from the instance Monad (IterateeGM ...).
print_lines uses a specialised version of that instance, namely
Monad (IterateeGM el IO)
The fact that print_lines uses it makes GHC generate a specialised version of the instance decl.
Even in the absence of print_lines you can generate the specialised instance thus
instance Monad m => Monad (IterateeGM el m) where
{-# SPECIALISE instance Monad (IterateeGM el IO) #-}
... methods...
does that help?
Simon
| -----Original Message-----
| From: John Lato [mailto:jwlato@gmail.com]
| Sent: 28 November 2008 12:07
| To: Simon Peyton-Jones
| Cc: Neil Mitchell; glasgow-haskell-users@haskell.org; Don Stewart
| Subject: Re: cross module optimization issues
|
| Neil, thank you very much for taking the time to look at this; I
| greatly appreciate it.
|
| One thing I don't understand is why the specializations are caused by
| print_lines. I suppose the optimizer can infer something which it
| couldn't otherwise.
|
| If I read this properly, the functions being specialized are liftI,
| (>>=), return, and $f2. One thing I'm not sure about is when INLINE
| provides the desired optimal behavior, as opposed to SPECIALIZE. The
| monad functions are defined in the Monad instance, and thus aren't
| currently INLINE'd or SPECIALIZE'd. However, if they are separate
| functions, would INLINE be sufficient? Would that give the optimizer
| enough to work with the derive the specializations on its own? I'll
| have some time to experiment with this myself tomorrow, but I'd
| appreciate some direction (rather than guessing blindly).
|
| What is "$f2"? I've seen that appear before, but I'm not sure where
| it comes from.
|
| Thanks,
| John
|
| On Fri, Nov 28, 2008 at 10:31 AM, Simon Peyton-Jones
|

On 28/11/2008, at 15:46, Simon Peyton-Jones wrote:
The $f2 comes from the instance Monad (IterateeGM ...). print_lines uses a specialised version of that instance, namely Monad (IterateeGM el IO) The fact that print_lines uses it makes GHC generate a specialised version of the instance decl.
Even in the absence of print_lines you can generate the specialised instance thus
instance Monad m => Monad (IterateeGM el m) where {-# SPECIALISE instance Monad (IterateeGM el IO) #-} ... methods...
does that help?
Once Simon and Neil dig the issue and analyze it, the reason seems evident. But this thread reminds of why writing high performance Haskell code is regarded as a black art outside the community (well, and sometimes inside too). Wouldn't a JIT version of GHC be a great thing to have? Or would a backend for LLVM be already beneficial enough? Cheers pepe

Hi
instance Monad m => Monad (IterateeGM el m) where {-# SPECIALISE instance Monad (IterateeGM el IO) #-}
does that help?
Yes. With that specialise line in, we get identical performance between the two results. So, in summary: The print_lines function uses the IterateeGM with IO as the underlying monad, which causes GHC to specialise IterateeGM with IO. If print_lines is not exported, then it is deleted as dead code, and the specialisation is never generated. The specialisation is crucial for performance later on. In this way, by keeping unused code reachable, GHC does better optimisation.
Once Simon and Neil dig the issue and analyze it, the reason seems evident. But this thread reminds of why writing high performance Haskell code is regarded as a black art outside the community (well, and sometimes inside too).
Wouldn't a JIT version of GHC be a great thing to have? Or would a backend for LLVM be already beneficial enough?
I don't think either would have the benefits offered by specialisation. If GHC exported more information about instances, it could do more specialisations later, but it is a trade off. If you ran GHC in some whole-program mode, then you wouldn't have the problem, but would gain additional problems. I always hoped Supero (http://www-users.cs.york.ac.uk/~ndm/supero/) would remove some of the black art associated with program optimisation - there are no specialise pragmas, and I'm pretty sure in the above example it would have done the correct thing. In some ways, whole-program and fewer special cases gives a much better mental model of how optimisation might effect a program. Of course, its still a research prototype, but perhaps one day... Thanks Neil

Yes, this does help, thank you. I didn't know you could generate
specialized instances. In fact, I was so sure that this was some
arcane feature I immediately went to the GHC User Guide because I
didn't believe it was documented.
I immediately stumbled upon Section 8.13.9.
Thanks to everyone who helped me with this. I think I've achieved a
small bit of enlightenment.
Cheers,
John
On Fri, Nov 28, 2008 at 2:46 PM, Simon Peyton-Jones
The $f2 comes from the instance Monad (IterateeGM ...). print_lines uses a specialised version of that instance, namely Monad (IterateeGM el IO) The fact that print_lines uses it makes GHC generate a specialised version of the instance decl.
Even in the absence of print_lines you can generate the specialised instance thus
instance Monad m => Monad (IterateeGM el m) where {-# SPECIALISE instance Monad (IterateeGM el IO) #-} ... methods...
does that help?
Simon

jwlato:
On Wed, Nov 19, 2008 at 4:17 PM, Simon Peyton-Jones
wrote: | I'm compiling with -O2 -Wall. After looking at the Core output, I | think I've found the key difference. A function that is bound in a | "where" statement is different between the monolithic and split | sources. I have no idea why, though. I'll experiment with a few | different things to see if I can get this resolved.
In general, splitting code across modules should not make programs less efficient -- as Don says, GHC does quite aggressive cross-module inlining.
There is one exception, though. If a non-exported non-recursive function is called exactly once, then it is inlined *regardless of size*, because doing so does not cause code duplication. But if it's exported and is large, then its inlining is not exposed -- and even if it were it might not be inlined, because doing so duplicates its code an unknown number of times. You can change the threshold for (a) exposing and (b) using an inlining, with flags -funfolding-creation-threshold and -funfolding-use-threshold respectively.
If you find there's something else going on then I'm all ears.
Simon
I did finally find the changes that make a difference. I think it's safe to say that I have no idea what's actually going on, so I'll just report my results and let others try to figure it out.
I tried upping the thresholds mentioned, up to -funfolding-creation-threshold 200 -funfolding-use-threshold 100. This didn't seem to make any performance difference (I didn't check the core output).
This project is based on Oleg's Iteratee code; I started using his IterateeM.hs and Enumerator.hs files and added my own stuff to Enumerator.hs (thanks Oleg, great work as always). When I started cleaning up by moving my functions from Enumerator.hs to MyEnum.hs, my minimal test case increased from 19s to 43s.
I've found two factors that contributed. When I was cleaning up, I also removed a bunch of unused functions from IterateeM.hs (some of the test functions and functions specific to his running example of HTTP encoding). When I added those functions back in, and added INLINE pragmas to the exported functions in MyEnum.hs, I got the performance back.
In general I hadn't added export lists to the modules yet, so all functions should have been exported.
So it seems that somehow the unused functions in IterateeM.hs are affecting how the functions I care about get implemented (or exported). I did not expect that. Next step for me is to see what happens if I INLINE the functions I'm exporting and remove the others, I suppose.
Thank you Simon and Don for your advice, especially since I'm pretty far over my head at this point.
Is this , since it is in IO code, a -fno-state-hack scenario? Simon wrote recently about when and why -fno-state-hack would be needed, if you want to follow that up. -- Don

On Sat, Nov 22, 2008 at 6:55 PM, Don Stewart
jwlato:
Is this , since it is in IO code, a -fno-state-hack scenario? Simon wrote recently about when and why -fno-state-hack would be needed, if you want to follow that up.
-- Don
Unfortunately, -fno-state-hack doesn't seem to make much difference. In any case, only the functions that actually do file IO are in the IO monad; otherwise the functions use a generic Monad constraint. Although you have reminded me that I should make a non-IO test case. For Neil, and anyone else interested in looking at this, I'll put the code and build instructions up later today. I've just been cleaning up some test cases to make it easier to run. John
participants (6)
-
Don Stewart
-
John Lato
-
Mitchell, Neil
-
Neil Mitchell
-
pepe
-
Simon Peyton-Jones