Where do I start if I would like help improve GHC compilation times?

newer
Re: Where do I start if I would...

older
downloads.haskell.org unavailable...

Alfredo Di Napoli

7 Apr 2017 7 Apr '17

3:29 p.m.

Hey folks, maybe I’m setting up for something too ambitious for me, but I would like to take an active stance to the overlasting “GHC compilation times are terrible” matter, instead of simply stare at the screen with despair whenever GHC compiles a sufficiently large Haskell program ;) To make this even more interesting, I have never contributed to GHC either! The max I have pushed myself into was 2 years ago when I successfully built GHC head from source and tried to fix an Haddock “easy” ticket I don’t even recall (full disclosure, eventually I didn’t :D ). Specifically, I would love community recommendations & guidance about: 1. Is this simply too daunting for somebody like me? Maybe is better to first start contributing more regularly, take confidence with the code base AND then move forward? 2. Are compilation times largely dependant from the target platform (I’m on Darwin) or there is something which can be done “globally” so that the benefits can be experienced by everybody? 3. Is there any recommended workflow to profile GHC compilation times? Is there any build flavour one should prefer when doing so? (Maybe the full, slowest one?) Thanks in advance for taking the time reading this mail, and have a nice weekend! Alfredo

Attachments:

attachment.html (text/html — 1.5 KB)

Show replies by date

Ben Gamari

7 Apr 7 Apr

4:22 p.m.

Alfredo Di Napoli writes:

...

Hey folks,

Hi Alfredo! First, thanks for writing. More eyes looking at GHC's compiler performance is badly needed.

...

maybe I’m setting up for something too ambitious for me, but I would like to take an active stance to the overlasting “GHC compilation times are terrible” matter, instead of simply stare at the screen with despair whenever GHC compiles a sufficiently large Haskell program ;)

To make this even more interesting, I have never contributed to GHC either! The max I have pushed myself into was 2 years ago when I successfully built GHC head from source and tried to fix an Haddock “easy” ticket I don’t even recall (full disclosure, eventually I didn’t :D ).

Specifically, I would love community recommendations & guidance about:

1. Is this simply too daunting for somebody like me? Maybe is better to first start contributing more regularly, take confidence with the code base AND then move forward?

As with any software project, it is possible to treat the compiler as a black box, throw a profiler at it and see what hotspots show up. This gives you a place to focus your effort, allowing you to learn a small area and broaden your knowledge as necessary. However, I think it's fair to say that you will be significantly more productive if you first develop a basic understanding of the compilation pipeline. I'd recommend having a look at the GHC Commentary [1] for a start. I think it also helps to have a rough idea of what "slow" means to you. I find it is quite helpful if you have a particular program which you feel compiles more slowly than you would like (especially if it even compiles slowly with -O0, since then much less of the compiler is involved in compilation). Another approach is to look for programs whose compilation time has regressed over the course of GHC releases. It is not hard to find these examples and it is often possible to bisect your way back to the regressing commit. Also, note that I have collected some notes pertaining to compiler performance on the Wiki [2]. Here you will find a number of tickets of interest (as well a some rough themes which I've noticed), some nofib results which might guide your efforts, as well as a list of some fixes which have been committed in the past. [1] https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler [2] https://ghc.haskell.org/trac/ghc/wiki/Performance/Compiler

...

2. Are compilation times largely dependant from the target platform (I’m on Darwin) or there is something which can be done “globally” so that the benefits can be experienced by everybody?

There are some external considerations (e.g. the platform's compiler and linking toolchain) which contribute to GHC's runtime. For instance, it is known that the BFD ld linker implementation that many Linux distributions use by default is a great deal slower than it could be. This particular issue has come up recently and I'm currently working on allowing us to use the more performant gold linker when available. However, I think it's fair to say that for most programs GHC's runtime is largely independent of platform. I would invite you to try compiling a package which you consider GHC to compile "slowly" with GHC's -v flag (and GHC 8.0.1 or newer). This will give you a rough breakdown of where time is spent. For many packages you will find that the simplifier and/or typechecker dominate, followed (often distantly) by native code generation. Of these steps native code generation is the only one with a strong platform dependence.

...

3. Is there any recommended workflow to profile GHC compilation times? Is there any build flavour one should prefer when doing so? (Maybe the full, slowest one?)

There are a few options here: * As of GHC 8.0 the compiler will output timing and allocation information for its various stages if run with -v. This can be extremely helpful to get a high-level picture of where the compiler is spending its time while compiling your program. This is almost always the right place to start. * As with any Haskell program, the cost centre profiler can be used to characterize the memory and CPU behavior of various parts of the compiler. GHC's source tree includes a "prof" build flavour which builds the compiler with profiling enabled. However it only includes a handful of cost-centres and is best used when you already have a rough idea where you are looking and can add further cost-centres to drill down to your hotspot. Simply enabling -fprof-exported across the entire tree just doesn't work in my experience: not only is the resulting compiler quite slow, but the profile you get is far too unwieldy to learn from. * Occassionally the ticky-ticky profiler can be helpful in identifying allocation hotspots without the full overhead of the cost-centre profiler. * In principle our newly-stable DWARF debug information can be used for profiling, although this is still a work in progress and requires a patched GHC for best results. It's probably best to stick to the more traditional profiling mechanisms for now. Anyways, I hope this helps. Always feel free to get in touch with me personally (IRC and email are both great) if you would like to discuss particular issues. Thanks again for your interest! Cheers, - Ben

Alfredo Di Napoli

4:30 p.m.

Hey Ben, thanks for the quite exhaustive reply! I’m on the go right now, but I promise to get back to you with a meaningful reply later this weekend ;) Alfredo On 7 April 2017 at 18:22, Ben Gamari wrote:

...

Alfredo Di Napoli writes:

...
Hey folks,

Hi Alfredo!

First, thanks for writing. More eyes looking at GHC's compiler performance is badly needed.

...
maybe I’m setting up for something too ambitious for me, but I would like to take an active stance to the overlasting “GHC compilation times are terrible” matter, instead of simply stare at the screen with despair whenever GHC compiles a sufficiently large Haskell program ;)

To make this even more interesting, I have never contributed to GHC either! The max I have pushed myself into was 2 years ago when I successfully built GHC head from source and tried to fix an Haddock “easy” ticket I don’t even recall (full disclosure, eventually I didn’t :D ).

Specifically, I would love community recommendations & guidance about:

1. Is this simply too daunting for somebody like me? Maybe is better to first start contributing more regularly, take confidence with the code base AND then move forward?

As with any software project, it is possible to treat the compiler as a black box, throw a profiler at it and see what hotspots show up. This gives you a place to focus your effort, allowing you to learn a small area and broaden your knowledge as necessary.

However, I think it's fair to say that you will be significantly more productive if you first develop a basic understanding of the compilation pipeline. I'd recommend having a look at the GHC Commentary [1] for a start.

I think it also helps to have a rough idea of what "slow" means to you. I find it is quite helpful if you have a particular program which you feel compiles more slowly than you would like (especially if it even compiles slowly with -O0, since then much less of the compiler is involved in compilation). Another approach is to look for programs whose compilation time has regressed over the course of GHC releases. It is not hard to find these examples and it is often possible to bisect your way back to the regressing commit.

Also, note that I have collected some notes pertaining to compiler performance on the Wiki [2]. Here you will find a number of tickets of interest (as well a some rough themes which I've noticed), some nofib results which might guide your efforts, as well as a list of some fixes which have been committed in the past.

[1] https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler [2] https://ghc.haskell.org/trac/ghc/wiki/Performance/Compiler

...
2. Are compilation times largely dependant from the target platform (I’m on Darwin) or there is something which can be done “globally” so that the benefits can be experienced by everybody?

There are some external considerations (e.g. the platform's compiler and linking toolchain) which contribute to GHC's runtime. For instance, it is known that the BFD ld linker implementation that many Linux distributions use by default is a great deal slower than it could be. This particular issue has come up recently and I'm currently working on allowing us to use the more performant gold linker when available.

However, I think it's fair to say that for most programs GHC's runtime is largely independent of platform. I would invite you to try compiling a package which you consider GHC to compile "slowly" with GHC's -v flag (and GHC 8.0.1 or newer). This will give you a rough breakdown of where time is spent. For many packages you will find that the simplifier and/or typechecker dominate, followed (often distantly) by native code generation. Of these steps native code generation is the only one with a strong platform dependence.

...
3. Is there any recommended workflow to profile GHC compilation times? Is there any build flavour one should prefer when doing so? (Maybe the full, slowest one?)

There are a few options here:

* As of GHC 8.0 the compiler will output timing and allocation information for its various stages if run with -v. This can be extremely helpful to get a high-level picture of where the compiler is spending its time while compiling your program. This is almost always the right place to start.

* As with any Haskell program, the cost centre profiler can be used to characterize the memory and CPU behavior of various parts of the compiler.

GHC's source tree includes a "prof" build flavour which builds the compiler with profiling enabled. However it only includes a handful of cost-centres and is best used when you already have a rough idea where you are looking and can add further cost-centres to drill down to your hotspot.

Simply enabling -fprof-exported across the entire tree just doesn't work in my experience: not only is the resulting compiler quite slow, but the profile you get is far too unwieldy to learn from.

* Occassionally the ticky-ticky profiler can be helpful in identifying allocation hotspots without the full overhead of the cost-centre profiler.

* In principle our newly-stable DWARF debug information can be used for profiling, although this is still a work in progress and requires a patched GHC for best results. It's probably best to stick to the more traditional profiling mechanisms for now.

Anyways, I hope this helps. Always feel free to get in touch with me personally (IRC and email are both great) if you would like to discuss particular issues. Thanks again for your interest!

Cheers,

- Ben

Alfredo Di Napoli

9 Apr 9 Apr

9:37 a.m.

Hey Ben, as promised I’m back to you with something more articulated and hopefully meaningful. I do hear you perfectly — probably trying to dive head-first into this without at least a rough understanding of the performance hotspots or the GHC overall architecture is going to do me more harm than good (I get the overall picture and I’m aware of the different stages of the GHC compilation pipeline, but it’s far from saying I’m proficient with the architecture as whole). I have also read a couple of years ago the GHC chapter on the “Architeture of Open Source Applications” book, but I don’t know how much that is still relevant. If it is, I guess I should refresh my memory. I’m currently trying to move on 2 fronts — please advice if I’m a fool flogging a dead horse or if I have any hope of getting anything done ;) 1. I’m trying to treat indeed the compiler as a black block (as you adviced) trying to build a sufficiently large program where GHC is not “as fast as I would like” (I know that’s a very lame definition of “slow”, hehe). In particular, I have built the stage2 compiler with the “prof” flavour as you suggested, and I have chosen 2 examples as a reference “benchmark” for performance; DynFlags.hs (which seems to have been mentioned multiple times as a GHC perf killer) and the highlighting-kate package as posted here: https://ghc.haskell.org/trac/ghc/ticket/9221 . The idea would be to compile those with -v +RTS -p -hc -RTS enabled, look at the output from the .prof file AND the `-v` flag, find any hotspot, try to change something, recompile, observe diff, rinse and repeat. Do you think I have any hope of making progress this way? In particular, I think compiling DynFlags.hs is a bit of a dead-end; I whipped up this buggy script which escalated into a Behemoth which is compiling pretty much half of the compiler once again :D ``` #!/usr/bin/env bash ../ghc/inplace/bin/ghc-stage2 --make -j8 -v +RTS -A256M -qb0 -p -h \ -RTS -DSTAGE=2 -I../ghc/includes -I../ghc/compiler -I../ghc/compiler/stage2 \ -I../ghc/compiler/stage2/build \ -i../ghc/compiler/utils:../ghc/compiler/types:../ghc/compiler/typecheck:../ghc/compiler/basicTypes \ -i../ghc/compiler/main:../ghc/compiler/profiling:../ghc/compiler/coreSyn:../ghc/compiler/iface:../ghc/compiler/prelude \ -i../ghc/compiler/stage2/build:../ghc/compiler/simplStg:../ghc/compiler/cmm:../ghc/compiler/parser:../ghc/compiler/hsSyn \ -i../ghc/compiler/ghci:../ghc/compiler/deSugar:../ghc/compiler/simplCore:../ghc/compile/specialise \ -fforce-recomp -c $@ ``` I’m running it with `./dynflags.sh ../ghc/compiler/main/DynFlags.hs` but it’s taking a lot to compile (20+ mins on my 2014 mac Pro) because it’s pulling in half of the compiler anyway :D I tried to reuse the .hi files from my stage2 compilation but I failed (GHC was complaining about interface file mismatch). Short story short, I don’t think it will be a very agile way to proceed. Am I right? Do you have any recommendation in such sense? Do I have any hope to compile DynFlags.hs in a way which would make this perf investigation feasible? The second example (the highlighting-kate package) seems much more promising. It takes maybe 1-2 mins on my machine, which is enough to take a look at the perf output. Do you think I should follow this second lead? In principle any 50+ modules package I think would do (better if with a lot of TH ;) ) but this seems like a low-entry barrier start. 2. The second path I’m exploring is simply to take a less holistic approach and try to dive in into a performance ticket like the ones listed here: https://www.reddit.com/r/haskell/comments/45q90s/is_anything_being_done_to_r... Maybe some are very specific, but it seems like fixing small things and move forward could help giving me understanding of different sub-parts of GHC, which seems less intimidating than the black-box approach. In conclusion, what do you think is the best approach, 1 or 2, both or none? ;) Thank you! Alfredo On 7 April 2017 at 18:30, Alfredo Di Napoli wrote:

...

Hey Ben,

thanks for the quite exhaustive reply! I’m on the go right now, but I promise to get back to you with a meaningful reply later this weekend ;)

Alfredo

On 7 April 2017 at 18:22, Ben Gamari wrote:

...
Alfredo Di Napoli writes:

...
Hey folks,

Hi Alfredo!

First, thanks for writing. More eyes looking at GHC's compiler performance is badly needed.

...
maybe I’m setting up for something too ambitious for me, but I would like to take an active stance to the overlasting “GHC compilation times are terrible” matter, instead of simply stare at the screen with despair whenever GHC compiles a sufficiently large Haskell program ;)

To make this even more interesting, I have never contributed to GHC either! The max I have pushed myself into was 2 years ago when I successfully built GHC head from source and tried to fix an Haddock “easy” ticket I don’t even recall (full disclosure, eventually I didn’t :D ).

Specifically, I would love community recommendations & guidance about:

1. Is this simply too daunting for somebody like me? Maybe is better to first start contributing more regularly, take confidence with the code base AND then move forward?

As with any software project, it is possible to treat the compiler as a black box, throw a profiler at it and see what hotspots show up. This gives you a place to focus your effort, allowing you to learn a small area and broaden your knowledge as necessary.

However, I think it's fair to say that you will be significantly more productive if you first develop a basic understanding of the compilation pipeline. I'd recommend having a look at the GHC Commentary [1] for a start.

I think it also helps to have a rough idea of what "slow" means to you. I find it is quite helpful if you have a particular program which you feel compiles more slowly than you would like (especially if it even compiles slowly with -O0, since then much less of the compiler is involved in compilation). Another approach is to look for programs whose compilation time has regressed over the course of GHC releases. It is not hard to find these examples and it is often possible to bisect your way back to the regressing commit.

Also, note that I have collected some notes pertaining to compiler performance on the Wiki [2]. Here you will find a number of tickets of interest (as well a some rough themes which I've noticed), some nofib results which might guide your efforts, as well as a list of some fixes which have been committed in the past.

[1] https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler [2] https://ghc.haskell.org/trac/ghc/wiki/Performance/Compiler

...
2. Are compilation times largely dependant from the target platform (I’m on Darwin) or there is something which can be done “globally” so that the benefits can be experienced by everybody?

There are some external considerations (e.g. the platform's compiler and linking toolchain) which contribute to GHC's runtime. For instance, it is known that the BFD ld linker implementation that many Linux distributions use by default is a great deal slower than it could be. This particular issue has come up recently and I'm currently working on allowing us to use the more performant gold linker when available.

However, I think it's fair to say that for most programs GHC's runtime is largely independent of platform. I would invite you to try compiling a package which you consider GHC to compile "slowly" with GHC's -v flag (and GHC 8.0.1 or newer). This will give you a rough breakdown of where time is spent. For many packages you will find that the simplifier and/or typechecker dominate, followed (often distantly) by native code generation. Of these steps native code generation is the only one with a strong platform dependence.

...
3. Is there any recommended workflow to profile GHC compilation times? Is there any build flavour one should prefer when doing so? (Maybe the full, slowest one?)

There are a few options here:

* As of GHC 8.0 the compiler will output timing and allocation information for its various stages if run with -v. This can be extremely helpful to get a high-level picture of where the compiler is spending its time while compiling your program. This is almost always the right place to start.

* As with any Haskell program, the cost centre profiler can be used to characterize the memory and CPU behavior of various parts of the compiler.

GHC's source tree includes a "prof" build flavour which builds the compiler with profiling enabled. However it only includes a handful of cost-centres and is best used when you already have a rough idea where you are looking and can add further cost-centres to drill down to your hotspot.

Simply enabling -fprof-exported across the entire tree just doesn't work in my experience: not only is the resulting compiler quite slow, but the profile you get is far too unwieldy to learn from.

* Occassionally the ticky-ticky profiler can be helpful in identifying allocation hotspots without the full overhead of the cost-centre profiler.

* In principle our newly-stable DWARF debug information can be used for profiling, although this is still a work in progress and requires a patched GHC for best results. It's probably best to stick to the more traditional profiling mechanisms for now.

Anyways, I hope this helps. Always feel free to get in touch with me personally (IRC and email are both great) if you would like to discuss particular issues. Thanks again for your interest!

Cheers,

- Ben

Reid Barton

8:42 p.m.

Building modules from GHC itself is a little tricky and DynFlags is extra tricky since it is involved in import cycles. Here is what I do: * Copy DynFlags.hs somewhere outside the tree (for your present purposes, it is no longer part of the compiler, but just some module to be provided as input). * Get rid of all the {-# SOURCE #-} pragmas on imports to turn them into ordinary, non-boot file imports. * Build with ".../ghc/inplace/bin/ghc-stage2 DynFlags -package ghc -I.../ghc/compiler/stage2" plus whatever other options you want (e.g., probably "-fforce-recomp -O +RTS -s" at a minimum). By using "-package ghc" you compile DynFlags against the version of ghc that you have just built. * This will result in some type errors, because DynFlags imports some functions that expect arguments of type DynFlags. (This relates to the import cycles that we broke earlier.) Since you are building against the version of those functions from the ghc package, they expect the type ghc:DynFlags.DynFlags, but they are now receiving a value of type DynFlags from the main package. This is no big deal, just insert an unsafeCoerce wherever necessary (mostly in front of occurrences of "dflags") to get the compiler to stop complaining. This is not 100% faithful to the way DynFlags would actually be compiled during a GHC build, but the advantage of this method is that you don't have to worry about GHC doing any recompilation checking between the copy of DynFlags that you are testing on and the compiler's own modules. Regards, Reid Barton On Sun, Apr 9, 2017 at 5:37 AM, Alfredo Di Napoli wrote:

...

Hey Ben,

as promised I’m back to you with something more articulated and hopefully meaningful. I do hear you perfectly — probably trying to dive head-first into this without at least a rough understanding of the performance hotspots or the GHC overall architecture is going to do me more harm than good (I get the overall picture and I’m aware of the different stages of the GHC compilation pipeline, but it’s far from saying I’m proficient with the architecture as whole). I have also read a couple of years ago the GHC chapter on the “Architeture of Open Source Applications” book, but I don’t know how much that is still relevant. If it is, I guess I should refresh my memory.

I’m currently trying to move on 2 fronts — please advice if I’m a fool flogging a dead horse or if I have any hope of getting anything done ;)

1. I’m trying to treat indeed the compiler as a black block (as you adviced) trying to build a sufficiently large program where GHC is not “as fast as I would like” (I know that’s a very lame definition of “slow”, hehe). In particular, I have built the stage2 compiler with the “prof” flavour as you suggested, and I have chosen 2 examples as a reference “benchmark” for performance; DynFlags.hs (which seems to have been mentioned multiple times as a GHC perf killer) and the highlighting-kate package as posted here: https://ghc.haskell.org/trac/ghc/ticket/9221 . The idea would be to compile those with -v +RTS -p -hc -RTS enabled, look at the output from the .prof file AND the `-v` flag, find any hotspot, try to change something, recompile, observe diff, rinse and repeat. Do you think I have any hope of making progress this way? In particular, I think compiling DynFlags.hs is a bit of a dead-end; I whipped up this buggy script which escalated into a Behemoth which is compiling pretty much half of the compiler once again :D

``` #!/usr/bin/env bash

../ghc/inplace/bin/ghc-stage2 --make -j8 -v +RTS -A256M -qb0 -p -h \ -RTS -DSTAGE=2 -I../ghc/includes -I../ghc/compiler -I../ghc/compiler/stage2 \ -I../ghc/compiler/stage2/build \ -i../ghc/compiler/utils:../ghc/compiler/types:../ghc/compiler/typecheck:../ghc/compiler/basicTypes \ -i../ghc/compiler/main:../ghc/compiler/profiling:../ghc/compiler/coreSyn:../ghc/compiler/iface:../ghc/compiler/prelude \ -i../ghc/compiler/stage2/build:../ghc/compiler/simplStg:../ghc/compiler/cmm:../ghc/compiler/parser:../ghc/compiler/hsSyn \ -i../ghc/compiler/ghci:../ghc/compiler/deSugar:../ghc/compiler/simplCore:../ghc/compile/specialise \ -fforce-recomp -c $@ ```

I’m running it with `./dynflags.sh ../ghc/compiler/main/DynFlags.hs` but it’s taking a lot to compile (20+ mins on my 2014 mac Pro) because it’s pulling in half of the compiler anyway :D I tried to reuse the .hi files from my stage2 compilation but I failed (GHC was complaining about interface file mismatch). Short story short, I don’t think it will be a very agile way to proceed. Am I right? Do you have any recommendation in such sense? Do I have any hope to compile DynFlags.hs in a way which would make this perf investigation feasible?

The second example (the highlighting-kate package) seems much more promising. It takes maybe 1-2 mins on my machine, which is enough to take a look at the perf output. Do you think I should follow this second lead? In principle any 50+ modules package I think would do (better if with a lot of TH ;) ) but this seems like a low-entry barrier start.

2. The second path I’m exploring is simply to take a less holistic approach and try to dive in into a performance ticket like the ones listed here: https://www.reddit.com/r/haskell/comments/45q90s/is_anything_being_done_to_r... Maybe some are very specific, but it seems like fixing small things and move forward could help giving me understanding of different sub-parts of GHC, which seems less intimidating than the black-box approach.

In conclusion, what do you think is the best approach, 1 or 2, both or none? ;)

Thank you!

Alfredo

On 7 April 2017 at 18:30, Alfredo Di Napoli wrote:

...
Hey Ben,

thanks for the quite exhaustive reply! I’m on the go right now, but I promise to get back to you with a meaningful reply later this weekend ;)

Alfredo

On 7 April 2017 at 18:22, Ben Gamari wrote:

...
Alfredo Di Napoli writes:

...
Hey folks,

Hi Alfredo!

First, thanks for writing. More eyes looking at GHC's compiler performance is badly needed.

...
maybe I’m setting up for something too ambitious for me, but I would like to take an active stance to the overlasting “GHC compilation times are terrible” matter, instead of simply stare at the screen with despair whenever GHC compiles a sufficiently large Haskell program ;)

To make this even more interesting, I have never contributed to GHC either! The max I have pushed myself into was 2 years ago when I successfully built GHC head from source and tried to fix an Haddock “easy” ticket I don’t even recall (full disclosure, eventually I didn’t :D ).

Specifically, I would love community recommendations & guidance about:

1. Is this simply too daunting for somebody like me? Maybe is better to first start contributing more regularly, take confidence with the code base AND then move forward?

As with any software project, it is possible to treat the compiler as a black box, throw a profiler at it and see what hotspots show up. This gives you a place to focus your effort, allowing you to learn a small area and broaden your knowledge as necessary.

However, I think it's fair to say that you will be significantly more productive if you first develop a basic understanding of the compilation pipeline. I'd recommend having a look at the GHC Commentary [1] for a start.

I think it also helps to have a rough idea of what "slow" means to you. I find it is quite helpful if you have a particular program which you feel compiles more slowly than you would like (especially if it even compiles slowly with -O0, since then much less of the compiler is involved in compilation). Another approach is to look for programs whose compilation time has regressed over the course of GHC releases. It is not hard to find these examples and it is often possible to bisect your way back to the regressing commit.

Also, note that I have collected some notes pertaining to compiler performance on the Wiki [2]. Here you will find a number of tickets of interest (as well a some rough themes which I've noticed), some nofib results which might guide your efforts, as well as a list of some fixes which have been committed in the past.

[1] https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler [2] https://ghc.haskell.org/trac/ghc/wiki/Performance/Compiler

...
2. Are compilation times largely dependant from the target platform (I’m on Darwin) or there is something which can be done “globally” so that the benefits can be experienced by everybody?

There are some external considerations (e.g. the platform's compiler and linking toolchain) which contribute to GHC's runtime. For instance, it is known that the BFD ld linker implementation that many Linux distributions use by default is a great deal slower than it could be. This particular issue has come up recently and I'm currently working on allowing us to use the more performant gold linker when available.

However, I think it's fair to say that for most programs GHC's runtime is largely independent of platform. I would invite you to try compiling a package which you consider GHC to compile "slowly" with GHC's -v flag (and GHC 8.0.1 or newer). This will give you a rough breakdown of where time is spent. For many packages you will find that the simplifier and/or typechecker dominate, followed (often distantly) by native code generation. Of these steps native code generation is the only one with a strong platform dependence.

...
3. Is there any recommended workflow to profile GHC compilation times? Is there any build flavour one should prefer when doing so? (Maybe the full, slowest one?)

There are a few options here:

* As of GHC 8.0 the compiler will output timing and allocation information for its various stages if run with -v. This can be extremely helpful to get a high-level picture of where the compiler is spending its time while compiling your program. This is almost always the right place to start.

* As with any Haskell program, the cost centre profiler can be used to characterize the memory and CPU behavior of various parts of the compiler.

GHC's source tree includes a "prof" build flavour which builds the compiler with profiling enabled. However it only includes a handful of cost-centres and is best used when you already have a rough idea where you are looking and can add further cost-centres to drill down to your hotspot.

Simply enabling -fprof-exported across the entire tree just doesn't work in my experience: not only is the resulting compiler quite slow, but the profile you get is far too unwieldy to learn from.

* Occassionally the ticky-ticky profiler can be helpful in identifying allocation hotspots without the full overhead of the cost-centre profiler.

* In principle our newly-stable DWARF debug information can be used for profiling, although this is still a work in progress and requires a patched GHC for best results. It's probably best to stick to the more traditional profiling mechanisms for now.

Anyways, I hope this helps. Always feel free to get in touch with me personally (IRC and email are both great) if you would like to discuss particular issues. Thanks again for your interest!

Cheers,

- Ben

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Ben Gamari

10 Apr 10 Apr

10:47 p.m.

Alfredo Di Napoli writes:

...

Hey Ben,

Hi Alfredo, Sorry for the late response! The email queue from the weekend was a bit longer than I would like.

...

as promised I’m back to you with something more articulated and hopefully meaningful. I do hear you perfectly — probably trying to dive head-first into this without at least a rough understanding of the performance hotspots or the GHC overall architecture is going to do me more harm than good (I get the overall picture and I’m aware of the different stages of the GHC compilation pipeline, but it’s far from saying I’m proficient with the architecture as whole). I have also read a couple of years ago the GHC chapter on the “Architeture of Open Source Applications” book, but I don’t know how much that is still relevant. If it is, I guess I should refresh my memory.

It sounds like you have done a good amount of reading. That's great. Perhaps skimming the AOSA chapter again wouldn't hurt, but otherwise it's likely worthwhile diving in.

...

I’m currently trying to move on 2 fronts — please advice if I’m a fool flogging a dead horse or if I have any hope of getting anything done ;)

1. I’m trying to treat indeed the compiler as a black block (as you adviced) trying to build a sufficiently large program where GHC is not “as fast as I would like” (I know that’s a very lame definition of “slow”, hehe). In particular, I have built the stage2 compiler with the “prof” flavour as you suggested, and I have chosen 2 examples as a reference “benchmark” for performance; DynFlags.hs (which seems to have been mentioned multiple times as a GHC perf killer) and the highlighting-kate package as posted here: https://ghc.haskell.org/trac/ghc/ticket/9221 .

Indeed, #9221 would be a very interesting ticket to look at. The highlighting-kate package is interesting in the context of that ticket as it has a very large amount of parallelism available. If you do want to look at #9221, note that the cost centre profiler may not provide the whole story. In particular, it has been speculated that the scaling issues may be due to either, * threads hitting a blackhole, resulting in blocking * the usual scaling limitations of GHC's stop-the-world GC The eventlog may be quite useful for characterising these.

...

The idea would be to compile those with -v +RTS -p -hc -RTS enabled, look at the output from the .prof file AND the `-v` flag, find any hotspot, try to change something, recompile, observe diff, rinse and repeat. Do you think I have any hope of making progress this way? In particular, I think compiling DynFlags.hs is a bit of a dead-end; I whipped up this buggy script which escalated into a Behemoth which is compiling pretty much half of the compiler once again :D

``` #!/usr/bin/env bash

../ghc/inplace/bin/ghc-stage2 --make -j8 -v +RTS -A256M -qb0 -p -h \ -RTS -DSTAGE=2 -I../ghc/includes -I../ghc/compiler -I../ghc/compiler/stage2 \ -I../ghc/compiler/stage2/build \ -i../ghc/compiler/utils:../ghc/compiler/types:../ghc/compiler/typecheck:../ghc/compiler/basicTypes \ -i../ghc/compiler/main:../ghc/compiler/profiling:../ghc/compiler/coreSyn:../ghc/compiler/iface:../ghc/compiler/prelude \ -i../ghc/compiler/stage2/build:../ghc/compiler/simplStg:../ghc/compiler/cmm:../ghc/compiler/parser:../ghc/compiler/hsSyn \ -i../ghc/compiler/ghci:../ghc/compiler/deSugar:../ghc/compiler/simplCore:../ghc/compile/specialise \ -fforce-recomp -c $@ ```

I’m running it with `./dynflags.sh ../ghc/compiler/main/DynFlags.hs` but it’s taking a lot to compile (20+ mins on my 2014 mac Pro) because it’s pulling in half of the compiler anyway :D I tried to reuse the .hi files from my stage2 compilation but I failed (GHC was complaining about interface file mismatch). Short story short, I don’t think it will be a very agile way to proceed. Am I right? Do you have any recommendation in such sense? Do I have any hope to compile DynFlags.hs in a way which would make this perf investigation feasible?

What I usually do in this case is just take the relevant `ghc` command line directly from the `make` output and execute it manually. I would imagine your debug cycle would look something like, * instrument the compiler * build stage1 * use stage2 to build DynFlags using the stage1 compiler (using a saved command line) * think * repeat This should only take a few minutes per iteration.

...

The second example (the highlighting-kate package) seems much more promising. It takes maybe 1-2 mins on my machine, which is enough to take a look at the perf output. Do you think I should follow this second lead? In principle any 50+ modules package I think would do (better if with a lot of TH ;) ) but this seems like a low-entry barrier start.

2. The second path I’m exploring is simply to take a less holistic approach and try to dive in into a performance ticket like the ones listed here: https://www.reddit.com/r/haskell/comments/45q90s/is_anything_being_done_to_r... Maybe some are very specific, but it seems like fixing small things and move forward could help giving me understanding of different sub-parts of GHC, which seems less intimidating than the black-box approach.

Do you have any specific tickets from these lists that you found interesting?

...

In conclusion, what do you think is the best approach, 1 or 2, both or none? ;)

I would say that it largely depends upon what you feel most comfortable with. If you feel up for it, I think #9221 would be a nice, fairly self-contained, yet high-impact ticket which would be worth spending a few days diving further into. Cheers, - Ben

Alfredo Di Napoli

17 Apr 17 Apr

5:44 p.m.

Dear all, after sprinkling (ehm, littering) GHC source code with cost centres, I was not surprised to see that roughly 20% of the compilation time (as in .prof) was spent in the core gen/simplification process (10% of the total time) and on the asm code gen (another 10%). I have almost immediately abandoned the idea of try optimising some modules in simplCore (considering my 0-knowledge of GHC internals anyway..) but I have been dwelling on the following: Outputable.hs and Pretty.hs seems to be have been implemented making deliberate use of lists and concatenations between them, which left me wondering if there was room for optimisation there. I have found this interesting paper on the topic: https://www.cs.kent.ac.uk/pubs/2005/2062/content.pdf Now, it’s totally possible that this has been already tried (with no success) but judging from the original copyright of Pretty.hs (dated 2001), it seems it was written prior to the work of Olaf Chitil (the author of the paper). TL;DR I was thinking (even just as a fun exercise to learn more about GHC internals) to leverage the ideas of that paper and switch to a different implementation for `Doc` coupled with the use of lazy dequeues, which *might* increase the performances of the codegen and thus of the compiler overall. Am I fighting a strawman (or flogging a dead horse, pick your rethorical figure :D ) or is there even a tiny chance of this being actually useful? Have a nice evening, Alfredo On 11 April 2017 at 00:47, Ben Gamari wrote:

...

Alfredo Di Napoli writes:

...
Hey Ben,

Hi Alfredo,

Sorry for the late response! The email queue from the weekend was a bit longer than I would like.

...
as promised I’m back to you with something more articulated and hopefully meaningful. I do hear you perfectly — probably trying to dive head-first into this without at least a rough understanding of the performance hotspots or the GHC overall architecture is going to do me more harm than good (I get the overall picture and I’m aware of the different stages of the GHC compilation pipeline, but it’s far from saying I’m proficient with the architecture as whole). I have also read a couple of years ago the GHC chapter on the “Architeture of Open Source Applications” book, but I don’t know how much that is still relevant. If it is, I guess I should refresh my memory.

It sounds like you have done a good amount of reading. That's great. Perhaps skimming the AOSA chapter again wouldn't hurt, but otherwise it's likely worthwhile diving in.

...
I’m currently trying to move on 2 fronts — please advice if I’m a fool flogging a dead horse or if I have any hope of getting anything done ;)

1. I’m trying to treat indeed the compiler as a black block (as you adviced) trying to build a sufficiently large program where GHC is not “as fast as I would like” (I know that’s a very lame definition of “slow”, hehe). In particular, I have built the stage2 compiler with the “prof” flavour as you suggested, and I have chosen 2 examples as a reference “benchmark” for performance; DynFlags.hs (which seems to have been mentioned multiple times as a GHC perf killer) and the highlighting-kate package as posted here: https://ghc.haskell.org/trac/ghc/ticket/9221 .

Indeed, #9221 would be a very interesting ticket to look at. The highlighting-kate package is interesting in the context of that ticket as it has a very large amount of parallelism available.

If you do want to look at #9221, note that the cost centre profiler may not provide the whole story. In particular, it has been speculated that the scaling issues may be due to either,

* threads hitting a blackhole, resulting in blocking

* the usual scaling limitations of GHC's stop-the-world GC

The eventlog may be quite useful for characterising these.

...
The idea would be to compile those with -v +RTS -p -hc -RTS enabled, look at the output from the .prof file AND the `-v` flag, find any hotspot, try to change something, recompile, observe diff, rinse and repeat. Do you think I have any hope of making progress this way? In particular, I think compiling DynFlags.hs is a bit of a dead-end; I whipped up this buggy script which escalated into a Behemoth which is compiling pretty much half of the compiler once again :D

``` #!/usr/bin/env bash

../ghc/inplace/bin/ghc-stage2 --make -j8 -v +RTS -A256M -qb0 -p -h \ -RTS -DSTAGE=2 -I../ghc/includes -I../ghc/compiler -I../ghc/compiler/stage2 \ -I../ghc/compiler/stage2/build \ -i../ghc/compiler/utils:../ghc/compiler/types:../ghc/ compiler/typecheck:../ghc/compiler/basicTypes \ -i../ghc/compiler/main:../ghc/compiler/profiling:../ghc/ compiler/coreSyn:../ghc/compiler/iface:../ghc/compiler/prelude \ -i../ghc/compiler/stage2/build:../ghc/compiler/ simplStg:../ghc/compiler/cmm:../ghc/compiler/parser:../ghc/compiler/hsSyn \ -i../ghc/compiler/ghci:../ghc/compiler/deSugar:../ghc/ compiler/simplCore:../ghc/compile/specialise \ -fforce-recomp -c $@ ```

I’m running it with `./dynflags.sh ../ghc/compiler/main/DynFlags.hs` but it’s taking a lot to compile (20+ mins on my 2014 mac Pro) because it’s pulling in half of the compiler anyway :D I tried to reuse the .hi files from my stage2 compilation but I failed (GHC was complaining about interface file mismatch). Short story short, I don’t think it will be a very agile way to proceed. Am I right? Do you have any recommendation in such sense? Do I have any hope to compile DynFlags.hs in a way which would make this perf investigation feasible?

What I usually do in this case is just take the relevant `ghc` command line directly from the `make` output and execute it manually. I would imagine your debug cycle would look something like,

* instrument the compiler * build stage1 * use stage2 to build DynFlags using the stage1 compiler (using a saved command line) * think * repeat

This should only take a few minutes per iteration.

...
The second example (the highlighting-kate package) seems much more promising. It takes maybe 1-2 mins on my machine, which is enough to take a look at the perf output. Do you think I should follow this second lead? In principle any 50+ modules package I think would do (better if with a lot of TH ;) ) but this seems like a low-entry barrier start.

2. The second path I’m exploring is simply to take a less holistic approach and try to dive in into a performance ticket like the ones listed here: https://www.reddit.com/r/haskell/comments/45q90s/is_ anything_being_done_to_remedy_the_soul/czzq6an/ Maybe some are very specific, but it seems like fixing small things and move forward could help giving me understanding of different sub-parts of GHC, which seems less intimidating than the black-box approach.

Do you have any specific tickets from these lists that you found interesting?

...
In conclusion, what do you think is the best approach, 1 or 2, both or none? ;)

I would say that it largely depends upon what you feel most comfortable with. If you feel up for it, I think #9221 would be a nice, fairly self-contained, yet high-impact ticket which would be worth spending a few days diving further into.

Cheers,

- Ben

Simon Marlow

18 Apr 18 Apr

10:01 a.m.

Pretty-printing the asm is a likely contender for optimisation, however the problem is not the pretty-printing per se. We don't actually use any of the backtracking stuff when printing asm, since there's no point nicely indenting things or wrapping lines. The overhead is probably all in the layer of data structure that we generate in Pretty before it gets dumped into raw bytes. Using a ByteString Builder instead might yield some improvement. Cheers Simon On 17 April 2017 at 18:44, Alfredo Di Napoli wrote:

...

Dear all,

after sprinkling (ehm, littering) GHC source code with cost centres, I was not surprised to see that roughly 20% of the compilation time (as in .prof) was spent in the core gen/simplification process (10% of the total time) and on the asm code gen (another 10%).

I have almost immediately abandoned the idea of try optimising some modules in simplCore (considering my 0-knowledge of GHC internals anyway..) but I have been dwelling on the following: Outputable.hs and Pretty.hs seems to be have been implemented making deliberate use of lists and concatenations between them, which left me wondering if there was room for optimisation there. I have found this interesting paper on the topic:

https://www.cs.kent.ac.uk/pubs/2005/2062/content.pdf

Now, it’s totally possible that this has been already tried (with no success) but judging from the original copyright of Pretty.hs (dated 2001), it seems it was written prior to the work of Olaf Chitil (the author of the paper).

TL;DR I was thinking (even just as a fun exercise to learn more about GHC internals) to leverage the ideas of that paper and switch to a different implementation for `Doc` coupled with the use of lazy dequeues, which *might* increase the performances of the codegen and thus of the compiler overall. Am I fighting a strawman (or flogging a dead horse, pick your rethorical figure :D ) or is there even a tiny chance of this being actually useful?

Have a nice evening,

Alfredo

On 11 April 2017 at 00:47, Ben Gamari wrote:

...
Alfredo Di Napoli writes:

...
Hey Ben,

Hi Alfredo,

Sorry for the late response! The email queue from the weekend was a bit longer than I would like.

...
as promised I’m back to you with something more articulated and hopefully meaningful. I do hear you perfectly — probably trying to dive head-first into this without at least a rough understanding of the performance hotspots or the GHC overall architecture is going to do me more harm than good (I get the overall picture and I’m aware of the different stages of the GHC compilation pipeline, but it’s far from saying I’m proficient with the architecture as whole). I have also read a couple of years ago the GHC chapter on the “Architeture of Open Source Applications” book, but I don’t know how much that is still relevant. If it is, I guess I should refresh my memory.

It sounds like you have done a good amount of reading. That's great. Perhaps skimming the AOSA chapter again wouldn't hurt, but otherwise it's likely worthwhile diving in.

...
I’m currently trying to move on 2 fronts — please advice if I’m a fool flogging a dead horse or if I have any hope of getting anything done ;)

1. I’m trying to treat indeed the compiler as a black block (as you adviced) trying to build a sufficiently large program where GHC is not “as fast as I would like” (I know that’s a very lame definition of “slow”, hehe). In particular, I have built the stage2 compiler with the “prof” flavour as you suggested, and I have chosen 2 examples as a reference “benchmark” for performance; DynFlags.hs (which seems to have been mentioned multiple times as a GHC perf killer) and the highlighting-kate package as posted here: https://ghc.haskell.org/trac/ghc/ticket/9221 .

Indeed, #9221 would be a very interesting ticket to look at. The highlighting-kate package is interesting in the context of that ticket as it has a very large amount of parallelism available.

If you do want to look at #9221, note that the cost centre profiler may not provide the whole story. In particular, it has been speculated that the scaling issues may be due to either,

* threads hitting a blackhole, resulting in blocking

* the usual scaling limitations of GHC's stop-the-world GC

The eventlog may be quite useful for characterising these.

...
The idea would be to compile those with -v +RTS -p -hc -RTS enabled, look at the output from the .prof file AND the `-v` flag, find any hotspot, try to change something, recompile, observe diff, rinse and repeat. Do you think I have any hope of making progress this way? In particular, I think compiling DynFlags.hs is a bit of a dead-end; I whipped up this buggy script which escalated into a Behemoth which is compiling pretty much half of the compiler once again :D

``` #!/usr/bin/env bash

../ghc/inplace/bin/ghc-stage2 --make -j8 -v +RTS -A256M -qb0 -p -h \ -RTS -DSTAGE=2 -I../ghc/includes -I../ghc/compiler -I../ghc/compiler/stage2 \ -I../ghc/compiler/stage2/build \ -i../ghc/compiler/utils:../ghc/compiler/types:../ghc/compile r/typecheck:../ghc/compiler/basicTypes \ -i../ghc/compiler/main:../ghc/compiler/profiling:../ghc/comp iler/coreSyn:../ghc/compiler/iface:../ghc/compiler/prelude \ -i../ghc/compiler/stage2/build:../ghc/compiler/simplStg:../ ghc/compiler/cmm:../ghc/compiler/parser:../ghc/compiler/hsSyn \ -i../ghc/compiler/ghci:../ghc/compiler/deSugar:../ghc/compil er/simplCore:../ghc/compile/specialise \ -fforce-recomp -c $@ ```

I’m running it with `./dynflags.sh ../ghc/compiler/main/DynFlags.hs` but it’s taking a lot to compile (20+ mins on my 2014 mac Pro) because it’s pulling in half of the compiler anyway :D I tried to reuse the .hi files from my stage2 compilation but I failed (GHC was complaining about interface file mismatch). Short story short, I don’t think it will be a very agile way to proceed. Am I right? Do you have any recommendation in such sense? Do I have any hope to compile DynFlags.hs in a way which would make this perf investigation feasible?

What I usually do in this case is just take the relevant `ghc` command line directly from the `make` output and execute it manually. I would imagine your debug cycle would look something like,

* instrument the compiler * build stage1 * use stage2 to build DynFlags using the stage1 compiler (using a saved command line) * think * repeat

This should only take a few minutes per iteration.

...
The second example (the highlighting-kate package) seems much more promising. It takes maybe 1-2 mins on my machine, which is enough to take a look at the perf output. Do you think I should follow this second lead? In principle any 50+ modules package I think would do (better if with a lot of TH ;) ) but this seems like a low-entry barrier start.

2. The second path I’m exploring is simply to take a less holistic approach and try to dive in into a performance ticket like the ones listed here: https://www.reddit.com/r/haskell/comments/45q90s/is_anything _being_done_to_remedy_the_soul/czzq6an/ Maybe some are very specific, but it seems like fixing small things and move forward could help giving me understanding of different sub-parts of GHC, which seems less intimidating than the black-box approach.

Do you have any specific tickets from these lists that you found interesting?

...
In conclusion, what do you think is the best approach, 1 or 2, both or none? ;)

I would say that it largely depends upon what you feel most comfortable with. If you feel up for it, I think #9221 would be a nice, fairly self-contained, yet high-impact ticket which would be worth spending a few days diving further into.

Cheers,

- Ben

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Alfredo Di Napoli

10:17 a.m.

Hey Simon, thanks for chiming in. I had the same suspect as well, and in fact my first temptation was to use dlists to avoid costly concatenation (I guess a Builder shares pretty much the same idea, which is avoid right-appends). I eventually bailed out as that Pretty module had some functions like sepX or fillNBE (I might be wrong, going by memory only here) which needed to *inspect* the current [Doc] we were carrying around, and thus dlists couldn’t accomplish this in O(1), but forcing it back to a list was necessary. Am I right in thinking that using a Builder here will suffer the same malady? Ideally I would like constant time for both appends, left concat & inspection (as in pattern-matching inspection) but I don’t think what I’m asking exists, at least not in its functional declination anyway ;) That’s why I was thinking to give Deques (or more generally, Sequences) a go: they don’t have O(1) appends but at least can be inspected in O(1). Question has to be answered whether this will yield improvement or not, which I guess depends upon the ratio of inspection / appending we do in the pretty printing. In that case using dlists or builders might be better. Last but not least, although the current implementation doesn’t backtrack anyway, I’m wondering wether switching to a different representation for a Doc (namely a huge function composition of Token(s), as described in the paper) could be beneficial as well. Do you guys have any opinions? Yesterday I extracted Pretty.hs from the sourcetree and I’m now planning to produce a criterion benchmark and compare different implementations, althought it’s a bit hard to predict the real world usage as I don’t have access to a representative Doc document as produced by GHC, so my benchs could be all ill-founded. Alfredo On 18 April 2017 at 12:01, Simon Marlow wrote:

...

Pretty-printing the asm is a likely contender for optimisation, however the problem is not the pretty-printing per se. We don't actually use any of the backtracking stuff when printing asm, since there's no point nicely indenting things or wrapping lines. The overhead is probably all in the layer of data structure that we generate in Pretty before it gets dumped into raw bytes. Using a ByteString Builder instead might yield some improvement.

Cheers Simon

On 17 April 2017 at 18:44, Alfredo Di Napoli wrote:

...
Dear all,

after sprinkling (ehm, littering) GHC source code with cost centres, I was not surprised to see that roughly 20% of the compilation time (as in .prof) was spent in the core gen/simplification process (10% of the total time) and on the asm code gen (another 10%).

I have almost immediately abandoned the idea of try optimising some modules in simplCore (considering my 0-knowledge of GHC internals anyway..) but I have been dwelling on the following: Outputable.hs and Pretty.hs seems to be have been implemented making deliberate use of lists and concatenations between them, which left me wondering if there was room for optimisation there. I have found this interesting paper on the topic:

https://www.cs.kent.ac.uk/pubs/2005/2062/content.pdf

Now, it’s totally possible that this has been already tried (with no success) but judging from the original copyright of Pretty.hs (dated 2001), it seems it was written prior to the work of Olaf Chitil (the author of the paper).

TL;DR I was thinking (even just as a fun exercise to learn more about GHC internals) to leverage the ideas of that paper and switch to a different implementation for `Doc` coupled with the use of lazy dequeues, which *might* increase the performances of the codegen and thus of the compiler overall. Am I fighting a strawman (or flogging a dead horse, pick your rethorical figure :D ) or is there even a tiny chance of this being actually useful?

Have a nice evening,

Alfredo

On 11 April 2017 at 00:47, Ben Gamari wrote:

...
Alfredo Di Napoli writes:

...
Hey Ben,

Hi Alfredo,

Sorry for the late response! The email queue from the weekend was a bit longer than I would like.

...
as promised I’m back to you with something more articulated and hopefully meaningful. I do hear you perfectly — probably trying to dive head-first into this without at least a rough understanding of the performance hotspots or the GHC overall architecture is going to do me more harm than good (I get the overall picture and I’m aware of the different stages of the GHC compilation pipeline, but it’s far from saying I’m proficient with the architecture as whole). I have also read a couple of years ago the GHC chapter on the “Architeture of Open Source Applications” book, but I don’t know how much that is still relevant. If it is, I guess I should refresh my memory.

It sounds like you have done a good amount of reading. That's great. Perhaps skimming the AOSA chapter again wouldn't hurt, but otherwise it's likely worthwhile diving in.

...
I’m currently trying to move on 2 fronts — please advice if I’m a fool flogging a dead horse or if I have any hope of getting anything done ;)

1. I’m trying to treat indeed the compiler as a black block (as you adviced) trying to build a sufficiently large program where GHC is not “as fast as I would like” (I know that’s a very lame definition of “slow”, hehe). In particular, I have built the stage2 compiler with the “prof” flavour as you suggested, and I have chosen 2 examples as a reference “benchmark” for performance; DynFlags.hs (which seems to have been mentioned multiple times as a GHC perf killer) and the highlighting-kate package as posted here: https://ghc.haskell.org/trac/ghc/ticket/9221 .

Indeed, #9221 would be a very interesting ticket to look at. The highlighting-kate package is interesting in the context of that ticket as it has a very large amount of parallelism available.

If you do want to look at #9221, note that the cost centre profiler may not provide the whole story. In particular, it has been speculated that the scaling issues may be due to either,

* threads hitting a blackhole, resulting in blocking

* the usual scaling limitations of GHC's stop-the-world GC

The eventlog may be quite useful for characterising these.

...
The idea would be to compile those with -v +RTS -p -hc -RTS enabled, look at the output from the .prof file AND the `-v` flag, find any hotspot, try to change something, recompile, observe diff, rinse and repeat. Do you think I have any hope of making progress this way? In particular, I think compiling DynFlags.hs is a bit of a dead-end; I whipped up this buggy script which escalated into a Behemoth which is compiling pretty much half of the compiler once again :D

``` #!/usr/bin/env bash

../ghc/inplace/bin/ghc-stage2 --make -j8 -v +RTS -A256M -qb0 -p -h \ -RTS -DSTAGE=2 -I../ghc/includes -I../ghc/compiler -I../ghc/compiler/stage2 \ -I../ghc/compiler/stage2/build \ -i../ghc/compiler/utils:../ghc/compiler/types:../ghc/compile r/typecheck:../ghc/compiler/basicTypes \ -i../ghc/compiler/main:../ghc/compiler/profiling:../ghc/comp iler/coreSyn:../ghc/compiler/iface:../ghc/compiler/prelude \ -i../ghc/compiler/stage2/build:../ghc/compiler/simplStg:../g hc/compiler/cmm:../ghc/compiler/parser:../ghc/compiler/hsSyn \ -i../ghc/compiler/ghci:../ghc/compiler/deSugar:../ghc/compil er/simplCore:../ghc/compile/specialise \ -fforce-recomp -c $@ ```

I’m running it with `./dynflags.sh ../ghc/compiler/main/DynFlags.hs` but it’s taking a lot to compile (20+ mins on my 2014 mac Pro) because it’s pulling in half of the compiler anyway :D I tried to reuse the .hi files from my stage2 compilation but I failed (GHC was complaining about interface file mismatch). Short story short, I don’t think it will be a very agile way to proceed. Am I right? Do you have any recommendation in such sense? Do I have any hope to compile DynFlags.hs in a way which would make this perf investigation feasible?

What I usually do in this case is just take the relevant `ghc` command line directly from the `make` output and execute it manually. I would imagine your debug cycle would look something like,

* instrument the compiler * build stage1 * use stage2 to build DynFlags using the stage1 compiler (using a saved command line) * think * repeat

This should only take a few minutes per iteration.

...
The second example (the highlighting-kate package) seems much more promising. It takes maybe 1-2 mins on my machine, which is enough to take a look at the perf output. Do you think I should follow this second lead? In principle any 50+ modules package I think would do (better if with a lot of TH ;) ) but this seems like a low-entry barrier start.

2. The second path I’m exploring is simply to take a less holistic approach and try to dive in into a performance ticket like the ones listed here: https://www.reddit.com/r/haskell/comments/45q90s/is_anything _being_done_to_remedy_the_soul/czzq6an/ Maybe some are very specific, but it seems like fixing small things and move forward could help giving me understanding of different sub-parts of GHC, which seems less intimidating than the black-box approach.

Do you have any specific tickets from these lists that you found interesting?

...
In conclusion, what do you think is the best approach, 1 or 2, both or none? ;)

I would say that it largely depends upon what you feel most comfortable with. If you feel up for it, I think #9221 would be a nice, fairly self-contained, yet high-impact ticket which would be worth spending a few days diving further into.

Cheers,

- Ben

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Bartosz Nitka

11:01 a.m.

Hey Alfredo, Thanks for taking a look at this. I've taken a look at this before and collected some notes here: https://github.com/niteria/notes/blob/master/PrettyPrintingGHC.md#problems Based on my investigations I concluded that there are places where pretty-printing Asm and Cmm gets accidentally quadratic. If I remember correctly the core of the problem was that even in the "fast" mode the pretty printer was computing the lengths of subtrees in `reduceDoc`, which in some cases made the operations quadratic. I've tried implementing a "just append bytes"-mode without `reduceDoc`, but what stopped me was my lack of understanding of different semantics for 3 kinds of empty Doc's. I think if you took the time to understand it, you could implement it in a way that's not quadratic, reaping a substantial performance win. Cheers, Bartosz 2017-04-18 11:17 GMT+01:00 Alfredo Di Napoli :

...

Hey Simon,

thanks for chiming in. I had the same suspect as well, and in fact my first temptation was to use dlists to avoid costly concatenation (I guess a Builder shares pretty much the same idea, which is avoid right-appends). I eventually bailed out as that Pretty module had some functions like sepX or fillNBE (I might be wrong, going by memory only here) which needed to *inspect* the current [Doc] we were carrying around, and thus dlists couldn’t accomplish this in O(1), but forcing it back to a list was necessary. Am I right in thinking that using a Builder here will suffer the same malady? Ideally I would like constant time for both appends, left concat & inspection (as in pattern-matching inspection) but I don’t think what I’m asking exists, at least not in its functional declination anyway ;)

That’s why I was thinking to give Deques (or more generally, Sequences) a go: they don’t have O(1) appends but at least can be inspected in O(1). Question has to be answered whether this will yield improvement or not, which I guess depends upon the ratio of inspection / appending we do in the pretty printing. In that case using dlists or builders might be better. Last but not least, although the current implementation doesn’t backtrack anyway, I’m wondering wether switching to a different representation for a Doc (namely a huge function composition of Token(s), as described in the paper) could be beneficial as well.

Do you guys have any opinions? Yesterday I extracted Pretty.hs from the sourcetree and I’m now planning to produce a criterion benchmark and compare different implementations, althought it’s a bit hard to predict the real world usage as I don’t have access to a representative Doc document as produced by GHC, so my benchs could be all ill-founded.

Alfredo

On 18 April 2017 at 12:01, Simon Marlow wrote:

...
Pretty-printing the asm is a likely contender for optimisation, however the problem is not the pretty-printing per se. We don't actually use any of the backtracking stuff when printing asm, since there's no point nicely indenting things or wrapping lines. The overhead is probably all in the layer of data structure that we generate in Pretty before it gets dumped into raw bytes. Using a ByteString Builder instead might yield some improvement.

Cheers Simon

On 17 April 2017 at 18:44, Alfredo Di Napoli wrote:

...
Dear all,

after sprinkling (ehm, littering) GHC source code with cost centres, I was not surprised to see that roughly 20% of the compilation time (as in .prof) was spent in the core gen/simplification process (10% of the total time) and on the asm code gen (another 10%).

I have almost immediately abandoned the idea of try optimising some modules in simplCore (considering my 0-knowledge of GHC internals anyway..) but I have been dwelling on the following: Outputable.hs and Pretty.hs seems to be have been implemented making deliberate use of lists and concatenations between them, which left me wondering if there was room for optimisation there. I have found this interesting paper on the topic:

https://www.cs.kent.ac.uk/pubs/2005/2062/content.pdf

Now, it’s totally possible that this has been already tried (with no success) but judging from the original copyright of Pretty.hs (dated 2001), it seems it was written prior to the work of Olaf Chitil (the author of the paper).

TL;DR I was thinking (even just as a fun exercise to learn more about GHC internals) to leverage the ideas of that paper and switch to a different implementation for `Doc` coupled with the use of lazy dequeues, which *might* increase the performances of the codegen and thus of the compiler overall. Am I fighting a strawman (or flogging a dead horse, pick your rethorical figure :D ) or is there even a tiny chance of this being actually useful?

Have a nice evening,

Alfredo

On 11 April 2017 at 00:47, Ben Gamari wrote:

...
Alfredo Di Napoli writes:

...
Hey Ben,

Hi Alfredo,

Sorry for the late response! The email queue from the weekend was a bit longer than I would like.

...
as promised I’m back to you with something more articulated and hopefully meaningful. I do hear you perfectly — probably trying to dive head-first into this without at least a rough understanding of the performance hotspots or the GHC overall architecture is going to do me more harm than good (I get the overall picture and I’m aware of the different stages of the GHC compilation pipeline, but it’s far from saying I’m proficient with the architecture as whole). I have also read a couple of years ago the GHC chapter on the “Architeture of Open Source Applications” book, but I don’t know how much that is still relevant. If it is, I guess I should refresh my memory.

It sounds like you have done a good amount of reading. That's great. Perhaps skimming the AOSA chapter again wouldn't hurt, but otherwise it's likely worthwhile diving in.

...
I’m currently trying to move on 2 fronts — please advice if I’m a fool flogging a dead horse or if I have any hope of getting anything done ;)

1. I’m trying to treat indeed the compiler as a black block (as you adviced) trying to build a sufficiently large program where GHC is not “as fast as I would like” (I know that’s a very lame definition of “slow”, hehe). In particular, I have built the stage2 compiler with the “prof” flavour as you suggested, and I have chosen 2 examples as a reference “benchmark” for performance; DynFlags.hs (which seems to have been mentioned multiple times as a GHC perf killer) and the highlighting-kate package as posted here: https://ghc.haskell.org/trac/ghc/ticket/9221 .

Indeed, #9221 would be a very interesting ticket to look at. The highlighting-kate package is interesting in the context of that ticket as it has a very large amount of parallelism available.

If you do want to look at #9221, note that the cost centre profiler may not provide the whole story. In particular, it has been speculated that the scaling issues may be due to either,

* threads hitting a blackhole, resulting in blocking

* the usual scaling limitations of GHC's stop-the-world GC

The eventlog may be quite useful for characterising these.

...
The idea would be to compile those with -v +RTS -p -hc -RTS enabled, look at the output from the .prof file AND the `-v` flag, find any hotspot, try to change something, recompile, observe diff, rinse and repeat. Do you think I have any hope of making progress this way? In particular, I think compiling DynFlags.hs is a bit of a dead-end; I whipped up this buggy script which escalated into a Behemoth which is compiling pretty much half of the compiler once again :D

``` #!/usr/bin/env bash

../ghc/inplace/bin/ghc-stage2 --make -j8 -v +RTS -A256M -qb0 -p -h \ -RTS -DSTAGE=2 -I../ghc/includes -I../ghc/compiler -I../ghc/compiler/stage2 \ -I../ghc/compiler/stage2/build \ -i../ghc/compiler/utils:../ghc/compiler/types:../ghc/compile r/typecheck:../ghc/compiler/basicTypes \ -i../ghc/compiler/main:../ghc/compiler/profiling:../ghc/comp iler/coreSyn:../ghc/compiler/iface:../ghc/compiler/prelude \ -i../ghc/compiler/stage2/build:../ghc/compiler/simplStg:../g hc/compiler/cmm:../ghc/compiler/parser:../ghc/compiler/hsSyn \ -i../ghc/compiler/ghci:../ghc/compiler/deSugar:../ghc/compil er/simplCore:../ghc/compile/specialise \ -fforce-recomp -c $@ ```

I’m running it with `./dynflags.sh ../ghc/compiler/main/DynFlags.hs` but it’s taking a lot to compile (20+ mins on my 2014 mac Pro) because it’s pulling in half of the compiler anyway :D I tried to reuse the .hi files from my stage2 compilation but I failed (GHC was complaining about interface file mismatch). Short story short, I don’t think it will be a very agile way to proceed. Am I right? Do you have any recommendation in such sense? Do I have any hope to compile DynFlags.hs in a way which would make this perf investigation feasible?

What I usually do in this case is just take the relevant `ghc` command line directly from the `make` output and execute it manually. I would imagine your debug cycle would look something like,

* instrument the compiler * build stage1 * use stage2 to build DynFlags using the stage1 compiler (using a saved command line) * think * repeat

This should only take a few minutes per iteration.

...
The second example (the highlighting-kate package) seems much more promising. It takes maybe 1-2 mins on my machine, which is enough to take a look at the perf output. Do you think I should follow this second lead? In principle any 50+ modules package I think would do (better if with a lot of TH ;) ) but this seems like a low-entry barrier start.

2. The second path I’m exploring is simply to take a less holistic approach and try to dive in into a performance ticket like the ones listed here: https://www.reddit.com/r/haskell/comments/45q90s/is_anything _being_done_to_remedy_the_soul/czzq6an/ Maybe some are very specific, but it seems like fixing small things and move forward could help giving me understanding of different sub-parts of GHC, which seems less intimidating than the black-box approach.

Do you have any specific tickets from these lists that you found interesting?

...
In conclusion, what do you think is the best approach, 1 or 2, both or none? ;)

I would say that it largely depends upon what you feel most comfortable with. If you feel up for it, I think #9221 would be a nice, fairly self-contained, yet high-impact ticket which would be worth spending a few days diving further into.

Cheers,

- Ben

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Alfredo Di Napoli

11:12 a.m.

Hey Bartosz, thanks for that, looks like I won the jackpot today ;) Sounds like your notes are an excellent starting point. Will try to get a better understanding of that module myself, to see if we can reap some perf win here, but the fact you “have been there before” looks like this is a promising path to take. Cheers! Alfredo On 18 April 2017 at 13:01, Bartosz Nitka wrote:

...

Hey Alfredo,

Thanks for taking a look at this. I've taken a look at this before and collected some notes here: https://github.com/niteria/notes/blob/master/ PrettyPrintingGHC.md#problems Based on my investigations I concluded that there are places where pretty-printing Asm and Cmm gets accidentally quadratic. If I remember correctly the core of the problem was that even in the "fast" mode the pretty printer was computing the lengths of subtrees in `reduceDoc`, which in some cases made the operations quadratic. I've tried implementing a "just append bytes"-mode without `reduceDoc`, but what stopped me was my lack of understanding of different semantics for 3 kinds of empty Doc's. I think if you took the time to understand it, you could implement it in a way that's not quadratic, reaping a substantial performance win.

Cheers, Bartosz

2017-04-18 11:17 GMT+01:00 Alfredo Di Napoli :

...
Hey Simon,

thanks for chiming in. I had the same suspect as well, and in fact my first temptation was to use dlists to avoid costly concatenation (I guess a Builder shares pretty much the same idea, which is avoid right-appends). I eventually bailed out as that Pretty module had some functions like sepX or fillNBE (I might be wrong, going by memory only here) which needed to *inspect* the current [Doc] we were carrying around, and thus dlists couldn’t accomplish this in O(1), but forcing it back to a list was necessary. Am I right in thinking that using a Builder here will suffer the same malady? Ideally I would like constant time for both appends, left concat & inspection (as in pattern-matching inspection) but I don’t think what I’m asking exists, at least not in its functional declination anyway ;)

That’s why I was thinking to give Deques (or more generally, Sequences) a go: they don’t have O(1) appends but at least can be inspected in O(1). Question has to be answered whether this will yield improvement or not, which I guess depends upon the ratio of inspection / appending we do in the pretty printing. In that case using dlists or builders might be better. Last but not least, although the current implementation doesn’t backtrack anyway, I’m wondering wether switching to a different representation for a Doc (namely a huge function composition of Token(s), as described in the paper) could be beneficial as well.

Do you guys have any opinions? Yesterday I extracted Pretty.hs from the sourcetree and I’m now planning to produce a criterion benchmark and compare different implementations, althought it’s a bit hard to predict the real world usage as I don’t have access to a representative Doc document as produced by GHC, so my benchs could be all ill-founded.

Alfredo

On 18 April 2017 at 12:01, Simon Marlow wrote:

...
Pretty-printing the asm is a likely contender for optimisation, however the problem is not the pretty-printing per se. We don't actually use any of the backtracking stuff when printing asm, since there's no point nicely indenting things or wrapping lines. The overhead is probably all in the layer of data structure that we generate in Pretty before it gets dumped into raw bytes. Using a ByteString Builder instead might yield some improvement.

Cheers Simon

On 17 April 2017 at 18:44, Alfredo Di Napoli
...
wrote:

...
Dear all,

after sprinkling (ehm, littering) GHC source code with cost centres, I was not surprised to see that roughly 20% of the compilation time (as in .prof) was spent in the core gen/simplification process (10% of the total time) and on the asm code gen (another 10%).

I have almost immediately abandoned the idea of try optimising some modules in simplCore (considering my 0-knowledge of GHC internals anyway..) but I have been dwelling on the following: Outputable.hs and Pretty.hs seems to be have been implemented making deliberate use of lists and concatenations between them, which left me wondering if there was room for optimisation there. I have found this interesting paper on the topic:

https://www.cs.kent.ac.uk/pubs/2005/2062/content.pdf

Now, it’s totally possible that this has been already tried (with no success) but judging from the original copyright of Pretty.hs (dated 2001), it seems it was written prior to the work of Olaf Chitil (the author of the paper).

TL;DR I was thinking (even just as a fun exercise to learn more about GHC internals) to leverage the ideas of that paper and switch to a different implementation for `Doc` coupled with the use of lazy dequeues, which *might* increase the performances of the codegen and thus of the compiler overall. Am I fighting a strawman (or flogging a dead horse, pick your rethorical figure :D ) or is there even a tiny chance of this being actually useful?

Have a nice evening,

Alfredo

On 11 April 2017 at 00:47, Ben Gamari wrote:

...
Alfredo Di Napoli writes:

...
Hey Ben,

Hi Alfredo,

Sorry for the late response! The email queue from the weekend was a bit longer than I would like.

...
as promised I’m back to you with something more articulated and hopefully meaningful. I do hear you perfectly — probably trying to dive head-first into this without at least a rough understanding of the performance hotspots or the GHC overall architecture is going to do me more harm than good (I get the overall picture and I’m aware of the different stages of the GHC compilation pipeline, but it’s far from saying I’m proficient with the architecture as whole). I have also read a couple of years ago the GHC chapter on the “Architeture of Open Source Applications” book, but I don’t know how much that is still relevant. If it is, I guess I should refresh my memory.

It sounds like you have done a good amount of reading. That's great. Perhaps skimming the AOSA chapter again wouldn't hurt, but otherwise it's likely worthwhile diving in.

...
I’m currently trying to move on 2 fronts — please advice if I’m a fool flogging a dead horse or if I have any hope of getting anything done ;)

1. I’m trying to treat indeed the compiler as a black block (as you adviced) trying to build a sufficiently large program where GHC is not “as fast as I would like” (I know that’s a very lame definition of “slow”, hehe). In particular, I have built the stage2 compiler with the “prof” flavour as you suggested, and I have chosen 2 examples as a reference “benchmark” for performance; DynFlags.hs (which seems to have been mentioned multiple times as a GHC perf killer) and the highlighting-kate package as posted here: https://ghc.haskell.org/trac/ghc/ticket/9221 .

Indeed, #9221 would be a very interesting ticket to look at. The highlighting-kate package is interesting in the context of that ticket as it has a very large amount of parallelism available.

If you do want to look at #9221, note that the cost centre profiler may not provide the whole story. In particular, it has been speculated that the scaling issues may be due to either,

* threads hitting a blackhole, resulting in blocking

* the usual scaling limitations of GHC's stop-the-world GC

The eventlog may be quite useful for characterising these.

...
The idea would be to compile those with -v +RTS -p -hc -RTS enabled, look at the output from the .prof file AND the `-v` flag, find any hotspot, try to change something, recompile, observe diff, rinse and repeat. Do you think I have any hope of making progress this way? In particular, I think compiling DynFlags.hs is a bit of a dead-end; I whipped up this buggy script which escalated into a Behemoth which is compiling pretty much half of the compiler once again :D

``` #!/usr/bin/env bash

../ghc/inplace/bin/ghc-stage2 --make -j8 -v +RTS -A256M -qb0 -p -h \ -RTS -DSTAGE=2 -I../ghc/includes -I../ghc/compiler -I../ghc/compiler/stage2 \ -I../ghc/compiler/stage2/build \ -i../ghc/compiler/utils:../ghc/compiler/types:../ghc/compile r/typecheck:../ghc/compiler/basicTypes \ -i../ghc/compiler/main:../ghc/compiler/profiling:../ghc/comp iler/coreSyn:../ghc/compiler/iface:../ghc/compiler/prelude \ -i../ghc/compiler/stage2/build:../ghc/compiler/simplStg:../g hc/compiler/cmm:../ghc/compiler/parser:../ghc/compiler/hsSyn \ -i../ghc/compiler/ghci:../ghc/compiler/deSugar:../ghc/compil er/simplCore:../ghc/compile/specialise \ -fforce-recomp -c $@ ```

I’m running it with `./dynflags.sh ../ghc/compiler/main/DynFlags.hs` but it’s taking a lot to compile (20+ mins on my 2014 mac Pro) because it’s pulling in half of the compiler anyway :D I tried to reuse the .hi files from my stage2 compilation but I failed (GHC was complaining about interface file mismatch). Short story short, I don’t think it will be a very agile way to proceed. Am I right? Do you have any recommendation in such sense? Do I have any hope to compile DynFlags.hs in a way which would make this perf investigation feasible?

What I usually do in this case is just take the relevant `ghc` command line directly from the `make` output and execute it manually. I would imagine your debug cycle would look something like,

* instrument the compiler * build stage1 * use stage2 to build DynFlags using the stage1 compiler (using a saved command line) * think * repeat

This should only take a few minutes per iteration.

...
The second example (the highlighting-kate package) seems much more promising. It takes maybe 1-2 mins on my machine, which is enough to take a look at the perf output. Do you think I should follow this second lead? In principle any 50+ modules package I think would do (better if with a lot of TH ;) ) but this seems like a low-entry barrier start.

2. The second path I’m exploring is simply to take a less holistic approach and try to dive in into a performance ticket like the ones listed here: https://www.reddit.com/r/haskell/comments/45q90s/is_anything _being_done_to_remedy_the_soul/czzq6an/ Maybe some are very specific, but it seems like fixing small things and move forward could help giving me understanding of different sub-parts of GHC, which seems less intimidating than the black-box approach.

Do you have any specific tickets from these lists that you found interesting?

...
In conclusion, what do you think is the best approach, 1 or 2, both or none? ;)

I would say that it largely depends upon what you feel most comfortable with. If you feel up for it, I think #9221 would be a nice, fairly self-contained, yet high-impact ticket which would be worth spending a few days diving further into.

Cheers,

- Ben

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Ben Gamari

12:21 p.m.

Alfredo Di Napoli writes:

...

Hey Simon,

thanks for chiming in. I had the same suspect as well, and in fact my first temptation was to use dlists to avoid costly concatenation (I guess a Builder shares pretty much the same idea, which is avoid right-appends). I eventually bailed out as that Pretty module had some functions like sepX or fillNBE (I might be wrong, going by memory only here) which needed to *inspect* the current [Doc] we were carrying around, and thus dlists couldn’t accomplish this in O(1), but forcing it back to a list was necessary. Am I right in thinking that using a Builder here will suffer the same malady? Ideally I would like constant time for both appends, left concat & inspection (as in pattern-matching inspection) but I don’t think what I’m asking exists, at least not in its functional declination anyway ;)

That’s why I was thinking to give Deques (or more generally, Sequences) a go: they don’t have O(1) appends but at least can be inspected in O(1). Question has to be answered whether this will yield improvement or not, which I guess depends upon the ratio of inspection / appending we do in the pretty printing. In that case using dlists or builders might be better. Last but not least, although the current implementation doesn’t backtrack anyway, I’m wondering wether switching to a different representation for a Doc (namely a huge function composition of Token(s), as described in the paper) could be beneficial as well.

Do you guys have any opinions? Yesterday I extracted Pretty.hs from the sourcetree and I’m now planning to produce a criterion benchmark and compare different implementations, althought it’s a bit hard to predict the real world usage as I don’t have access to a representative Doc document as produced by GHC, so my benchs could be all ill-founded.

Note that GHC's `Pretty` module is just a slightly modified version of the `pretty` package. The differences are fairly minimal (although important for performance): * It uses FastString in place of String, giving us fast `length` (in https://ghc.haskell.org/trac/ghc/ticket/8809#comment:60 I propose that we extend `pretty` with a typeclass for string types) * GHC's variant still has a known stack overflow bug that was fixed upstream. Unfortunately, compiler performance regressed when we attempted to port upstream's patch (#10735) Ideally we would fix these and just use the `pretty` library itself. In addition to these issues, it would be quite helpful if `pretty` gained a special-case for the infinite band-width case (which is what we use in the code generator). The point here is that we should need to do no layout computation in the infinite band case: merely place line breaks precisely where the user asks. This should result in a noticeable improvement in code generation performance (IIRC Bartosz noted rather significant amounts of time spent pretty-printing assembler). Cheers, - Ben

Alfredo Di Napoli

12:39 p.m.

Hey Ben, thanks for that. I didn’t realise I opened pretty much the Pandora’s box, hehe. I have now found (whilst reading Bartosz’s notes) the numerous tickets & discussions around the library, but what you say in this email neatly summarise the crux of the problem, as far as I could see. Rather than saying silly things, I would rather spend a bit of time reading everything that has been written by folks way smarter than me on the subject and get back to you folks later ;) Alfredo On 18 April 2017 at 14:21, Ben Gamari wrote:

...

Alfredo Di Napoli writes:

...
Hey Simon,

thanks for chiming in. I had the same suspect as well, and in fact my first temptation was to use dlists to avoid costly concatenation (I guess a Builder shares pretty much the same idea, which is avoid right-appends). I eventually bailed out as that Pretty module had some functions like sepX or fillNBE (I might be wrong, going by memory only here) which needed to *inspect* the current [Doc] we were carrying around, and thus dlists couldn’t accomplish this in O(1), but forcing it back to a list was necessary. Am I right in thinking that using a Builder here will suffer the same malady? Ideally I would like constant time for both appends, left concat & inspection (as in pattern-matching inspection) but I don’t think what I’m asking exists, at least not in its functional declination anyway ;)

That’s why I was thinking to give Deques (or more generally, Sequences) a go: they don’t have O(1) appends but at least can be inspected in O(1). Question has to be answered whether this will yield improvement or not, which I guess depends upon the ratio of inspection / appending we do in the pretty printing. In that case using dlists or builders might be better. Last but not least, although the current implementation doesn’t backtrack anyway, I’m wondering wether switching to a different representation for a Doc (namely a huge function composition of Token(s), as described in the paper) could be beneficial as well.

Do you guys have any opinions? Yesterday I extracted Pretty.hs from the sourcetree and I’m now planning to produce a criterion benchmark and compare different implementations, althought it’s a bit hard to predict the real world usage as I don’t have access to a representative Doc document as produced by GHC, so my benchs could be all ill-founded.

Note that GHC's `Pretty` module is just a slightly modified version of the `pretty` package. The differences are fairly minimal (although important for performance):

* It uses FastString in place of String, giving us fast `length` (in https://ghc.haskell.org/trac/ghc/ticket/8809#comment:60 I propose that we extend `pretty` with a typeclass for string types)

* GHC's variant still has a known stack overflow bug that was fixed upstream. Unfortunately, compiler performance regressed when we attempted to port upstream's patch (#10735)

Ideally we would fix these and just use the `pretty` library itself.

In addition to these issues, it would be quite helpful if `pretty` gained a special-case for the infinite band-width case (which is what we use in the code generator). The point here is that we should need to do no layout computation in the infinite band case: merely place line breaks precisely where the user asks. This should result in a noticeable improvement in code generation performance (IIRC Bartosz noted rather significant amounts of time spent pretty-printing assembler).

Cheers,

- Ben

Niklas Hambüchen

9 Apr 9 Apr

11:54 p.m.

I have some suggestions for low hanging fruits in this effort. 1. Make ghc print more statistics on what it spending time on When I did the linking investigation recently (https://www.reddit.com/r/haskell/comments/63y43y/liked_linking_3x_faster_wit...) I noticed (with strace) that there are lots of interesting syscalls being made that you might not expect. For example, each time TH is used, shared libraries are loaded, and to determine the shared library paths, ghc shells out to `gcc --print-file-name`. Each such invocation takes 20 ms on my system, and I have 1000 invocations in my build. That's 20 seconds (out of 2 minutes build time) just asking gcc for paths. I recommend that for every call to an external GHC measures how long that call took, so that it can be asked to print a summary when it's done. That might give us lots of interesting things to optimize. For example, This would have made the long linker times totally obvious. At the end, I would love to know for each compilation (both one-shot as used in ghc's build system, and `ghc --make`): * What programs did it invoke and how long did they take * What files did it read and how long did that take * How long did it take to read all the `.hi` files in `ghc --make` * High level time summary (parsing, typechecking, codegen, .hi files, etc) That way we'll know at least what is slow, and don't have to resort to strace every time in order to obtain this basic answer. 2. Investigate if idiotic syscalls are being done and how much There's this concept I call "idiotic syscalls", which are syscalls of which you know from before that they won't contribute anything productive. For example, if you give a linker N many `-L` flags (library dirs) and M many `-l` flags (library names to link), it will try to `stat()` or `open()` N*M many files, out of which most are total rubbish, because we typically know what library is in what dir. Example: You pass `-L/usr/lib/opencv -L/usr/lib/imagemagick -L/usr/lib/blas -lopencv -limagemagick -lblas`. Then you you will get things like `open("/usr/lib/opencv/libimagemagick.so") = ENOENT` which makes no sense and obviously that file doesn't exist. This is a problem with the general "search path" concept; same happens for running executables searching through $PATH. Yes, nonexistent file opens fail fast, but in my typical ghc invocation I get millions of them (and we should at least measure how much time is wasted on them), and they clutter the strace output and make the real problems harder to investigate. We should check if we can create ways to give pass those files that do exist. 3. Add pure TemplateHaskell It is well known that TH is a problem for incremental compilation because it can have side effects and we must therefore be more conservative about when to recompile; when you see a `[TH]` in your `ghc --make` output, it's likely that time again. I believe this could be avoided by adding a variant of TH that forbids the use of the `runIO` function, and can thus not have side effects. Most TH does not need side effects, for example any form of code generation based on other data types (lenses, instances for whatever). If that was made "pure TH", we would not have to recompile when inputs to our TH functions change. Potentially this could even be determined automatically instead of adding a new variant of TH like was done for typed TH `$$()`, simply by inspecting what's in the TH and if we can decide there's no `runIO` in there, mark it as clean, otherwise as tainted. 4. Build ghc with `ghc --make` if possible This one might be controversial or impossible (others can likely tell us). Most Haskell code is built with `ghc --make`, not with the one-shot compilation system + make or Hadrian as as done in GHC's build system. Weirdly, often `ghc --make` scales much worse and has much worse incremental recompilation times than the one-shot mode, which doesn't make sense given that it has no process creation overhead, can do much better caching etc. I believe that if ghc or large parts of it (e.g. stage2) itself was built with `--make`, we would magically see --make become very good, simply we make the right people (GHC devs) suffer through it daily :D. I expect from this the solution of the `-j` slowness, GHC overhead reduction, faster interface file loads and so on. These are some ideas. Niklas

Alfredo Di Napoli

10 Apr 10 Apr

2:54 p.m.

Hey all, thanks a ton for the invaluable pointers. I’m now in the “I-dunno-what-I-am-doing” mode banging SCC annotations like there is no tomorrow, trying to spot any chance for some low-hanging-fruit algorithmic improvement (like using a sequence instead of a list, etc), and will come back to your suggestions as I will face the inevitable dead-end wall :D Alfredo On 10 April 2017 at 01:54, Niklas Hambüchen wrote:

...

I have some suggestions for low hanging fruits in this effort.

1. Make ghc print more statistics on what it spending time on

When I did the linking investigation recently (https://www.reddit.com/r/haskell/comments/63y43y/liked_ linking_3x_faster_with_gold_link_10x_faster/) I noticed (with strace) that there are lots of interesting syscalls being made that you might not expect. For example, each time TH is used, shared libraries are loaded, and to determine the shared library paths, ghc shells out to `gcc --print-file-name`. Each such invocation takes 20 ms on my system, and I have 1000 invocations in my build. That's 20 seconds (out of 2 minutes build time) just asking gcc for paths.

I recommend that for every call to an external GHC measures how long that call took, so that it can be asked to print a summary when it's done.

That might give us lots of interesting things to optimize. For example, This would have made the long linker times totally obvious.

At the end, I would love to know for each compilation (both one-shot as used in ghc's build system, and `ghc --make`):

* What programs did it invoke and how long did they take * What files did it read and how long did that take * How long did it take to read all the `.hi` files in `ghc --make` * High level time summary (parsing, typechecking, codegen, .hi files, etc)

That way we'll know at least what is slow, and don't have to resort to strace every time in order to obtain this basic answer.

2. Investigate if idiotic syscalls are being done and how much

There's this concept I call "idiotic syscalls", which are syscalls of which you know from before that they won't contribute anything productive. For example, if you give a linker N many `-L` flags (library dirs) and M many `-l` flags (library names to link), it will try to `stat()` or `open()` N*M many files, out of which most are total rubbish, because we typically know what library is in what dir. Example: You pass `-L/usr/lib/opencv -L/usr/lib/imagemagick -L/usr/lib/blas -lopencv -limagemagick -lblas`. Then you you will get things like `open("/usr/lib/opencv/libimagemagick.so") = ENOENT` which makes no sense and obviously that file doesn't exist. This is a problem with the general "search path" concept; same happens for running executables searching through $PATH. Yes, nonexistent file opens fail fast, but in my typical ghc invocation I get millions of them (and we should at least measure how much time is wasted on them), and they clutter the strace output and make the real problems harder to investigate. We should check if we can create ways to give pass those files that do exist.

3. Add pure TemplateHaskell

It is well known that TH is a problem for incremental compilation because it can have side effects and we must therefore be more conservative about when to recompile; when you see a `[TH]` in your `ghc --make` output, it's likely that time again.

I believe this could be avoided by adding a variant of TH that forbids the use of the `runIO` function, and can thus not have side effects.

Most TH does not need side effects, for example any form of code generation based on other data types (lenses, instances for whatever). If that was made "pure TH", we would not have to recompile when inputs to our TH functions change.

Potentially this could even be determined automatically instead of adding a new variant of TH like was done for typed TH `$$()`, simply by inspecting what's in the TH and if we can decide there's no `runIO` in there, mark it as clean, otherwise as tainted.

4. Build ghc with `ghc --make` if possible

This one might be controversial or impossible (others can likely tell us). Most Haskell code is built with `ghc --make`, not with the one-shot compilation system + make or Hadrian as as done in GHC's build system. Weirdly, often `ghc --make` scales much worse and has much worse incremental recompilation times than the one-shot mode, which doesn't make sense given that it has no process creation overhead, can do much better caching etc. I believe that if ghc or large parts of it (e.g. stage2) itself was built with `--make`, we would magically see --make become very good, simply we make the right people (GHC devs) suffer through it daily :D. I expect from this the solution of the `-j` slowness, GHC overhead reduction, faster interface file loads and so on.

These are some ideas.

Niklas

Ben Gamari

10:15 p.m.

Niklas Hambüchen writes:

...

I have some suggestions for low hanging fruits in this effort.

Hi Niklas, Thanks for your suggestions, they are quite reasonable.

...

1. Make ghc print more statistics on what it spending time on

When I did the linking investigation recently (https://www.reddit.com/r/haskell/comments/63y43y/liked_linking_3x_faster_wit...) I noticed (with strace) that there are lots of interesting syscalls being made that you might not expect. For example, each time TH is used, shared libraries are loaded, and to determine the shared library paths, ghc shells out to `gcc --print-file-name`. Each such invocation takes 20 ms on my system, and I have 1000 invocations in my build. That's 20 seconds (out of 2 minutes build time) just asking gcc for paths.

I recommend that for every call to an external GHC measures how long that call took, so that it can be asked to print a summary when it's done.

Indeed. Since 8.0 GHC has had a scheme for tracking CPU time and allocations of various compiler phases. It would be great to implement a similar scheme for external tools. You can't necessarily reuse the existing infrastructure since you will want to use monotonic time instead of CPU time and allocations aren't relevant.

...

That might give us lots of interesting things to optimize. For example, This would have made the long linker times totally obvious.

I have a number of thoughts on this which I'll share in another message as they are largely orthogonal to the matter at hand.

...

At the end, I would love to know for each compilation (both one-shot as used in ghc's build system, and `ghc --make`):

* What programs did it invoke and how long did they take * What files did it read and how long did that take * How long did it take to read all the `.hi` files in `ghc --make` * High level time summary (parsing, typechecking, codegen, .hi files, etc)

We already have much of this. See D1959 for the relevant change. It would be great to extend this with more (optional) detail (e.g. measure individual interface file deserialization times).

...

That way we'll know at least what is slow, and don't have to resort to strace every time in order to obtain this basic answer.

Indeed. On the whole this would be a very nice, more-or-less self-contained project.

...

2. Investigate if idiotic syscalls are being done and how much

There's this concept I call "idiotic syscalls", which are syscalls of which you know from before that they won't contribute anything productive. For example, if you give a linker N many `-L` flags (library dirs) and M many `-l` flags (library names to link), it will try to `stat()` or `open()` N*M many files, out of which most are total rubbish, because we typically know what library is in what dir. Example: You pass `-L/usr/lib/opencv -L/usr/lib/imagemagick -L/usr/lib/blas -lopencv -limagemagick -lblas`. Then you you will get things like `open("/usr/lib/opencv/libimagemagick.so") = ENOENT` which makes no sense and obviously that file doesn't exist. This is a problem with the general "search path" concept; same happens for running executables searching through $PATH. Yes, nonexistent file opens fail fast, but in my typical ghc invocation I get millions of them (and we should at least measure how much time is wasted on them), and they clutter the strace output and make the real problems harder to investigate. We should check if we can create ways to give pass those files that do exist.

Indeed, the matter of library search paths is actually a pretty bad one although largely a Cabal issue (see GHC #11587).

...

3. Add pure TemplateHaskell

It is well known that TH is a problem for incremental compilation because it can have side effects and we must therefore be more conservative about when to recompile; when you see a `[TH]` in your `ghc --make` output, it's likely that time again.

I believe this could be avoided by adding a variant of TH that forbids the use of the `runIO` function, and can thus not have side effects.

Most TH does not need side effects, for example any form of code generation based on other data types (lenses, instances for whatever). If that was made "pure TH", we would not have to recompile when inputs to our TH functions change.

Potentially this could even be determined automatically instead of adding a new variant of TH like was done for typed TH `$$()`, simply by inspecting what's in the TH and if we can decide there's no `runIO` in there, mark it as clean, otherwise as tainted.

Cross-compiling users would also greatly appreciate this.

...

4. Build ghc with `ghc --make` if possible

This one might be controversial or impossible (others can likely tell us). Most Haskell code is built with `ghc --make`, not with the one-shot compilation system + make or Hadrian as as done in GHC's build system. Weirdly, often `ghc --make` scales much worse and has much worse incremental recompilation times than the one-shot mode, which doesn't make sense given that it has no process creation overhead, can do much better caching etc. I believe that if ghc or large parts of it (e.g. stage2) itself was built with `--make`, we would magically see --make become very good, simply we make the right people (GHC devs) suffer through it daily :D. I expect from this the solution of the `-j` slowness, GHC overhead reduction, faster interface file loads and so on.

One thing I've wondered about is how we can make use of compact regions to try to minimize GC costs in long-lived GHC sessions (e.g. a --make invocation on a large project). This would likely help parallel performance as well. N.B. #9221 is essentially the catch-all ticket for -j scalability issues. Cheers, - Ben

Ben Gamari

10:26 p.m.

New subject: GHC and linker choice

Niklas Hambüchen writes:

...

I have some suggestions for low hanging fruits in this effort.

1. Make ghc print more statistics on what it spending time on

When I did the linking investigation recently (https://www.reddit.com/r/haskell/comments/63y43y/liked_linking_3x_faster_wit...) I noticed (with strace) that there are lots of interesting syscalls being made that you might not expect. For example, each time TH is used, shared libraries are loaded, and to determine the shared library paths, ghc shells out to `gcc --print-file-name`. Each such invocation takes 20 ms on my system, and I have 1000 invocations in my build. That's 20 seconds (out of 2 minutes build time) just asking gcc for paths.

I recommend that for every call to an external GHC measures how long that call took, so that it can be asked to print a summary when it's done.

...

That might give us lots of interesting things to optimize. For example, This would have made the long linker times totally obvious.

For what it's worth, some of us have recognized for quite some time that BFD ld is a known slow spot in the compilation pipeline. However, up until now we have considered this to be more of an issue to be handled at the packaging level than in GHC issue. Currently, GHC relies on `gcc` to link and gcc has its own idea of the default linker. Many users (e.g. Debian packagers, users who are cross compiling) are quite unhappy when GHC coerces the system toolchain into changing its default behavior. Consequently we have maintained the status quo and silently wished that Linux distributions would start moving to gold by default. Sadly there continues to be little progress on this matter. However, given that so many users are now relying on GHC binary distributions and that BFD ld is an increasingly significant issue as dependency counts increase, I think the status quo is increasingly untenable. I have proposed (#13541) that we introduce configure logic to use gold when available. There are outstanding questions however * What interface (i.e. configure flags) do we expose to the user to allow them to turn on or off this logic? * Do we want to enable the logic by default in binary distributions to ensure that users see the benefits of gold if it is available? * Do we want to also enable it in source distributions as well? If yes then developers may be caught by surprise; if no then there will be an inconsistency between the default configuration of the source tree and binary distributions (although there is plenty of precedent for this). Anyways, let me know if you have thoughts on any of this. Hopefully we'll be able to get this done for 8.4. Cheers, - Ben

Niklas Hambüchen

12 Apr 12 Apr

10:46 p.m.

New subject: GHC and linker choice

Hi Ben,

...

For what it's worth, some of us have recognized for quite some time that BFD ld is a known slow spot in the compilation pipeline. However, up until now we have considered this to be more of an issue to be handled at the packaging level than in GHC issue. Currently, GHC relies on `gcc` to link and gcc has its own idea of the default linker. Many users (e.g. Debian packagers, users who are cross compiling) are quite unhappy when GHC coerces the system toolchain into changing its default behavior.

Consequently we have maintained the status quo and silently wished that Linux distributions would start moving to gold by default. Sadly there continues to be little progress on this matter.

This is a great point. I don't know if using BFD ld vs gold is a controversial move, given that they are both GNU projects. I would imagine if lld was in the game, that might be more controversial. I have the theory that both Haskell packagers and Linux distributions are simply conservative about changing the linker not because they don't want it, but because they are scared of breaking people's code, and especially in the packagers' case, that they don't feel expert enough to judge whether them making this move will be safe. I can imagine that they would rather welcome if the GHC devs made this decision, since after all they generate the linker invocations and have the knowledge to judge whether they are supposed to work. If a packager wants to *override* GHC's default behaviour, that should certainly be possible, but as a packager I would find it unfair if GHC pushed the burden of setting up the optimal defaults for my users on me. For a comparison with other programming languages: From tickets like https://github.com/golang/go/issues/1588 seems that go also uses gold by default when available. Rust switched to gold by default with https://github.com/rust-lang/rust/pull/29974, but reverted the commit with https://github.com/rust-lang/rust/pull/30913, with reasons being the linked tickets. The Rust tickets in particular have already discussed the various issues we may be running into regarding autodetection, cross compilation and so on, it may be good to get inspiration from there.

...

* Do we want to also enable it in source distributions as well? If yes then developers may be caught by surprise

I would recommend to do the same in source and binary distributions. The surprise should be minimised by properly advertising it as a key thing in the release notes. Packagers are supposed to read those, and in fact and most users of GHC love to read them for they contain many goodies. Another option is to not change the default or enable autodetection, but make it trivial to use gold for the user, e.g. by at least fixing the Cabal library (currently you need to put your linker options into a lot of places to get the desired results, see https://github.com/haskell/cabal/issues/4435, and even if you do, for some invocations Cabal just doesn't pass on what you tell it to do, see https://github.com/haskell/cabal/issues/4439). We could run this for a GHC release term, and also have cabal-install and stack recommend people that they should try to set some easy option to use the gold linker if available. If that works well, we could switch to autodetection the release after that. I haven't though much about whether this is actually a better idea than just enabling it where we can. Niklas

3001

Age (days ago)

3012

Last active (days ago)

List overview

Download

17 comments

7 participants

participants (7)

Alfredo Di Napoli
Bartosz Nitka
Ben Gamari
Ben Gamari
Niklas Hambüchen
Reid Barton
Simon Marlow

Where do I start if I would like help improve GHC compilation times?

tags

participants (7)