Potential GSoC proposal: Reduce the speed gap between 'ghc -c' and 'ghc --make'

Hi all, [Hoping it's not too late.] During my work on parallelising 'ghc --make' [1] I encountered a stumbling block: running 'ghc --make' can be often much faster than using separate compile ('ghc -c') and link stages, which means that any parallel build tool built on top of 'ghc -c' will be significantly handicapped [2]. As far as I understand, this is mainly due to the effects of interface file caching - 'ghc --make' only needs to parse and load them once. One potential improvement (suggested by Duncan Coutts [3]) is to produce whole-package interface files and load them in using mmap(). Questions: Would implementing this optimisation be a worthwhile/realistic GSoC project? What are other potential ways to bring 'ghc -c' performance up to par with 'ghc --make'? [1] https://github.com/23Skidoo/ghc-parmake [2] https://gist.github.com/1360470 [3] http://www.reddit.com/r/haskell/comments/qwj5j/the_cabal_of_my_dreams/c41a5g... -- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments

Questions:
Would implementing this optimisation be a worthwhile/realistic GSoC project? What are other potential ways to bring 'ghc -c' performance up to par with 'ghc --make'?
I implemented a ghc server that runs several persistent ghcs, and distributes compiles among them. It seemed to work, but the speed gains weren't what I'd hoped for. But perhaps I was doing something dumb that hurt performance. I'd be happy to upload the source if you're interested.

I for one think this would make a good GSoC project. Make sure you get your application in in time though. -- Johan

On 02/04/2012 07:37, Mikhail Glushenkov wrote:
Hi all,
[Hoping it's not too late.]
During my work on parallelising 'ghc --make' [1] I encountered a stumbling block: running 'ghc --make' can be often much faster than using separate compile ('ghc -c') and link stages, which means that any parallel build tool built on top of 'ghc -c' will be significantly handicapped [2]. As far as I understand, this is mainly due to the effects of interface file caching - 'ghc --make' only needs to parse and load them once. One potential improvement (suggested by Duncan Coutts [3]) is to produce whole-package interface files and load them in using mmap().
Questions:
Would implementing this optimisation be a worthwhile/realistic GSoC project? What are other potential ways to bring 'ghc -c' performance up to par with 'ghc --make'?
My guess is that this won't have a significant impact on ghc -c compile times. The advantage of squashing the .hi files for a package together is that they could share a string table, which would save a bit of space and time, but I think the time saved is small compared to the cost of deserialising and typechecking the declarations from the interface, which still has to be done. In fact it might make things worse, if the string table for the whole base package is larger than the individual tables that would be read from .hi files. I don't think mmap() will buy very much over the current scheme of just reading the file into a ByteArray. Of course this is all just (educated) guesswork without actual measurements, and I could be wrong... Perhaps there are ways to optimise the reading of interface files. A good first step would be to do some profiling and see where the hotspots are. Cheers, Simon

Hello Simon,
Sorry for the delay.
On Tue, Apr 10, 2012 at 1:03 PM, Simon Marlow
Questions:
Would implementing this optimisation be a worthwhile/realistic GSoC project? What are other potential ways to bring 'ghc -c' performance up to par with 'ghc --make'?
My guess is that this won't have a significant impact on ghc -c compile times.
The advantage of squashing the .hi files for a package together is that they could share a string table, which would save a bit of space and time, but I think the time saved is small compared to the cost of deserialising and typechecking the declarations from the interface, which still has to be done. In fact it might make things worse, if the string table for the whole base package is larger than the individual tables that would be read from .hi files. I don't think mmap() will buy very much over the current scheme of just reading the file into a ByteArray.
Thank you for the answer. I'll be working on another project during the summer, but I'm still interested in making interface files load faster. The idea that I currently like the most is to make it possible to save and load objects in the "GHC heap format". That way, deserialisation could be done with a simple fread() and a fast pointer fixup pass, which would hopefully make running many 'ghc -c' processes as fast as a single 'ghc --make'. This trick is commonly employed in the games industry to speed-up load times [1]. Given that Haskell is a garbage-collected language, the implementation will be trickier than in C++ and will have to be done on the RTS level. Is this a good idea? How hard it would be to implement this optimisation? Another idea (that I like less) is to implement a "build server" mode for GHC. That way, instead of a single 'ghc --make' we could run several ghc build servers in parallel. However, Evan Laforge's efforts in this direction didn't bring the expected speedup. Perhaps it's possible to improve on his work. [1] http://www.gamasutra.com/view/feature/132376/delicious_data_baking.php?print... -- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments

Thank you for the answer. I'll be working on another project during the summer, but I'm still interested in making interface files load faster.
The idea that I currently like the most is to make it possible to save and load objects in the "GHC heap format". That way, deserialisation could be done with a simple fread() and a fast pointer fixup pass, which would hopefully make running many 'ghc -c' processes as fast as a single 'ghc --make'. This trick is commonly employed in the games industry to speed-up load times [1]. Given that Haskell is a garbage-collected language, the implementation will be trickier than in C++ and will have to be done on the RTS level.
Is this a good idea? How hard it would be to implement this optimisation?
Another idea (that I like less) is to implement a "build server" mode for GHC. That way, instead of a single 'ghc --make' we could run several ghc build servers in parallel. However, Evan Laforge's efforts in this direction didn't bring the expected speedup. Perhaps it's possible to improve on his work.
I don't know if this would help, but I remember during Rob Pike's initial go talk he described how the 8g compiler could be so fast. I don't remember the exact details, but it was something to the effect that interface files would embed the interfaces of their depends, so this way the compiler only need read the direct imports, not the transitive dependency. Of course this might not work so well with ghc's fat interfaces with lots of inlined code, but it's a thought. One of the only impressive things that impressed me about go was the compilation speed, but that was quite impressive.

On 25/04/2012 03:17, Mikhail Glushenkov wrote:
Hello Simon,
Sorry for the delay.
On Tue, Apr 10, 2012 at 1:03 PM, Simon Marlow
wrote: Questions:
Would implementing this optimisation be a worthwhile/realistic GSoC project? What are other potential ways to bring 'ghc -c' performance up to par with 'ghc --make'?
My guess is that this won't have a significant impact on ghc -c compile times.
The advantage of squashing the .hi files for a package together is that they could share a string table, which would save a bit of space and time, but I think the time saved is small compared to the cost of deserialising and typechecking the declarations from the interface, which still has to be done. In fact it might make things worse, if the string table for the whole base package is larger than the individual tables that would be read from .hi files. I don't think mmap() will buy very much over the current scheme of just reading the file into a ByteArray.
Thank you for the answer. I'll be working on another project during the summer, but I'm still interested in making interface files load faster.
The idea that I currently like the most is to make it possible to save and load objects in the "GHC heap format". That way, deserialisation could be done with a simple fread() and a fast pointer fixup pass, which would hopefully make running many 'ghc -c' processes as fast as a single 'ghc --make'. This trick is commonly employed in the games industry to speed-up load times [1]. Given that Haskell is a garbage-collected language, the implementation will be trickier than in C++ and will have to be done on the RTS level.
Is this a good idea? How hard it would be to implement this optimisation?
I believe OCaml does something like this. I think the main difficulty is that the data structures in the heap are not the same every time, because we allocate unique identifiers sequentially as each Name is created. So to make this work you would have to make Names globally unique. Maybe using a 64-bit hash instead of the sequentially-allocated uniques would work, but that would entail quite a performance hit on 32-bit platforms (GHC uses IntMap everywhere with Unique as the key). On top of this there will be a *lot* of other complications (e.g. handling sharing well, mapping info pointers somehow). Personally I think it's at best very ambitious, and at worst not at all practical. Cheers, Simon
Another idea (that I like less) is to implement a "build server" mode for GHC. That way, instead of a single 'ghc --make' we could run several ghc build servers in parallel. However, Evan Laforge's efforts in this direction didn't bring the expected speedup. Perhaps it's possible to improve on his work.
[1] http://www.gamasutra.com/view/feature/132376/delicious_data_baking.php?print...

On 25/04/2012 08:57, Simon Marlow wrote:
On 25/04/2012 03:17, Mikhail Glushenkov wrote:
Hello Simon,
Sorry for the delay.
On Tue, Apr 10, 2012 at 1:03 PM, Simon Marlow
wrote: Questions:
Would implementing this optimisation be a worthwhile/realistic GSoC project? What are other potential ways to bring 'ghc -c' performance up to par with 'ghc --make'?
My guess is that this won't have a significant impact on ghc -c compile times.
The advantage of squashing the .hi files for a package together is that they could share a string table, which would save a bit of space and time, but I think the time saved is small compared to the cost of deserialising and typechecking the declarations from the interface, which still has to be done. In fact it might make things worse, if the string table for the whole base package is larger than the individual tables that would be read from .hi files. I don't think mmap() will buy very much over the current scheme of just reading the file into a ByteArray.
Thank you for the answer. I'll be working on another project during the summer, but I'm still interested in making interface files load faster.
The idea that I currently like the most is to make it possible to save and load objects in the "GHC heap format". That way, deserialisation could be done with a simple fread() and a fast pointer fixup pass, which would hopefully make running many 'ghc -c' processes as fast as a single 'ghc --make'. This trick is commonly employed in the games industry to speed-up load times [1]. Given that Haskell is a garbage-collected language, the implementation will be trickier than in C++ and will have to be done on the RTS level.
Is this a good idea? How hard it would be to implement this optimisation?
I believe OCaml does something like this.
I think the main difficulty is that the data structures in the heap are not the same every time, because we allocate unique identifiers sequentially as each Name is created. So to make this work you would have to make Names globally unique. Maybe using a 64-bit hash instead of the sequentially-allocated uniques would work, but that would entail quite a performance hit on 32-bit platforms (GHC uses IntMap everywhere with Unique as the key).
On top of this there will be a *lot* of other complications (e.g. handling sharing well, mapping info pointers somehow). Personally I think it's at best very ambitious, and at worst not at all practical.
Oh, I also meant to add: the best thing we could do initially is to profile GHC and see if there are improvements that could be made in the .hi file deserialisation/typechecking. Cheers, Simon
Cheers, Simon
Another idea (that I like less) is to implement a "build server" mode for GHC. That way, instead of a single 'ghc --make' we could run several ghc build servers in parallel. However, Evan Laforge's efforts in this direction didn't bring the expected speedup. Perhaps it's possible to improve on his work.
[1] http://www.gamasutra.com/view/feature/132376/delicious_data_baking.php?print...

The idea that I currently like the most is to make it possible to save
and load objects in the "GHC heap format". That way, deserialisation could be done with a simple fread() and a fast pointer fixup pass, which would hopefully make running many 'ghc -c' processes as fast as a single 'ghc --make'. This trick is commonly employed in the games industry to speed-up load times [1]. Given that Haskell is a garbage-collected language, the implementation will be trickier than in C++ and will have to be done on the RTS level.
Is this a good idea? How hard it would be to implement this optimisation?
I believe OCaml does something like this.
Interesting. What does OCaml do in this department? A bit of googling didn't turn up a link. For many years Chez scheme had a "saved heaps" capability. It was recently dropped because of the preponderance of SE Linux which randomizes addresses and messes it up, but here's the doc for V7: http://www.scheme.com/csug7/use.html#g10 I've always wondered why there weren't more language implementations with saved heaps. Under Chez the startup times were amazing (a 50KLOC compiler a two second load would become 4 milleseconds). Google Dart apparently has or will have saved heaps. It seems like an obvious choice (caching initialized heaps) for enormous websites with slow load times like GMail. Chez also has pretty fast serialization to a binary "FASL" (fast loading) format, but I'm not sure if those were mmap'ed into the heap on load or required some parsing. The gamasutra link that Mikhail provided seems to describe a process where the programmer knows exactly what the expected heap representation is for a particular object is, and manually creates it. Sounds like walking on thin ice. Do we know of any memory safe GC'd language implementations that can dump a single object (rather than the whole heap)? Would invoke the GC in a special way to trace the structure and copy it into a new region (to make it contiguous)? Cheers, -Ryan

Hello Simon,
On Wed, Apr 25, 2012 at 9:57 AM, Simon Marlow
Is this a good idea? How hard it would be to implement this optimisation?
[...] Personally I think it's at best very ambitious, and at worst not at all practical.
Thanks. I'll look into how to optimise .hi loading by more traditional means, then. I briefly looked at how marshalling is implemented in Ocaml (byterun/intern.c), and it looks like even though the Marshal module is unsafe and saves the data in an opaque format, the deserialisation algorithm is less radical than what I proposed. They go through the input buffer looking at the tag and size of the each object, and copy them to the heap buffer one-by-one. One interesting detail is that the heap buffer is allocated only once, because its size is known in advance. -- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments

On Thu, Apr 26, 2012 at 2:34 PM, Mikhail Glushenkov
Thanks. I'll look into how to optimise .hi loading by more traditional means, then.
Lennart is working on speeding up the binary package (which I believe is used to decode the .hi files.) His work might benefit this effort. -- Johan

On 26/04/2012 23:32, Johan Tibell wrote:
On Thu, Apr 26, 2012 at 2:34 PM, Mikhail Glushenkov
wrote: Thanks. I'll look into how to optimise .hi loading by more traditional means, then.
Lennart is working on speeding up the binary package (which I believe is used to decode the .hi files.) His work might benefit this effort.
We're still using our own Binary library in GHC. There's no good reason for that, unless using the binary package would be a performance regression. (we don't know whether that's the case or not, with the current binary). Cheers, Simon

Mikhail's original question was about loading interface files for entire
packages with mmap.
As a wild thought experiment, if GHC had a saved-heaps capability, I
believe that would avoid the Unique issues with mmap'ing individual data
structures that Simon mentioned. How about if each whole-package interface
were then a GHC saved heap that, when booted, would become an "interface"
server that would communicate with, and be shared by, other GHC build
server processes.
-Ryan
On Fri, Apr 27, 2012 at 4:57 AM, Simon Marlow
On 26/04/2012 23:32, Johan Tibell wrote:
On Thu, Apr 26, 2012 at 2:34 PM, Mikhail Glushenkov
wrote: Thanks. I'll look into how to optimise .hi loading by more traditional means, then.
Lennart is working on speeding up the binary package (which I believe is used to decode the .hi files.) His work might benefit this effort.
We're still using our own Binary library in GHC. There's no good reason for that, unless using the binary package would be a performance regression. (we don't know whether that's the case or not, with the current binary).
Cheers, Simon
______________________________**_________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.**org
http://www.haskell.org/**mailman/listinfo/glasgow-**haskell-usershttp://www.haskell.org/mailman/listinfo/glasgow-haskell-users

wrote: Thanks. I'll look into how to optimise .hi loading by more traditional means, then.
Lennart is working on speeding up the binary package (which I believe is used to decode the .hi files.) His work might benefit this effort.
Last time I tested it, mmap still offered better performance than fread on linux. In addition to improving the deserialization code it would seem like a good idea to mmap the whole file at the outset as well. It seems like readBinMem is the relevant function (readIFace -> readBinIFace -> readBinMem), which occurs here: https://github.com/ghc/ghc/blob/08894f96407635781a233145435a78f144accab0/com... Currently it does one big hGetBuf to read the file. Since the interface files aren't changing dynamically, I think it's safe to just replace this code with an mmap. It's nice to see that we have several wrapped versions of mmap provided on hackage: http://hackage.haskell.org/package/vector-mmap http://hackage.haskell.org/package/bytestring-mmap-0.2.2 http://hackage.haskell.org/package/mmap-0.5.7 Cheers, -Ryan

On 23/05/12 21:11, Ryan Newton wrote:
Thanks. I'll look into how to optimise .hi loading by more
mailto:the.dead.shall.rise@gmail.com> wrote: traditional means, then.
Lennart is working on speeding up the binary package (which I believe is used to decode the .hi files.) His work might benefit this effort.
Last time I tested it, mmap still offered better performance than fread on linux. In addition to improving the deserialization code it would seem like a good idea to mmap the whole file at the outset as well.
It seems like readBinMem is the relevant function (readIFace -> readBinIFace -> readBinMem), which occurs here:
https://github.com/ghc/ghc/blob/08894f96407635781a233145435a78f144accab0/com...
Currently it does one big hGetBuf to read the file. Since the interface files aren't changing dynamically, I think it's safe to just replace this code with an mmap.
I honestly don't think it will make much difference, because reading the files is not the bottleneck, but we'll happily accept a patch. Adding a new package dependency just for this doesn't seem worthwhile though. Cheers, Simon
participants (5)
-
Evan Laforge
-
Johan Tibell
-
Mikhail Glushenkov
-
Ryan Newton
-
Simon Marlow