GSOC Idea: Bytecode serialization and/or Fat Interface files

Hi all, This is following up on this recent discussion on the list concerning fat interface files: https://mail.haskell.org/pipermail/ghc-devs/2020-October/019324.html Now that we have been accepted as a GSOC organisation, I think it would be a good project idea for a sufficiently motivated and advanced student. This is a call for mentors (and students as well!) who would be interested in this project The problem is the following: Haskell Language Server (and ghci with `-fno-code`) have very fast startup times for codebases which don't make use of Template Haskell, and thus don't require any code-gen to typecheck. This is because they can simply read the cached iface files generated by a previous compile and don't need to re-invoke the typechecker. However, as soon as TH is involved, we are forced to retypecheck and compile files, since it is not possible to restart the code-gen process starting with only a iface file. I can think of two ways to address this problem: 1. Allow bytecode to be serialized 2. Serialize desugared Core into iface files (fat interfaces), so that (byte)code-gen can be restarted from this point and doesn't need (1) might be challenging, but offers a few more advantages over (2), in that we can reduce the work done to load TH-heavy codebases to just a load of the cached bytecode objects from disk, and could make the load process (and times) for these codebases directly comparable to their TH-free cousins. It would also make ghci startup a lot faster with a warm cache of bytecode objects, bringing ghci startup times in line with those of -fno-code However (2) might be much easier to achieve and offers many of the same advantages, in that we would not need to re-run the compiler frontend or core-to-core optimisation phases. There is also already a (slightly bitrotted) implementation of (2) thanks to the work of Edward Yang. If any of this sounds exciting to you as a student or a mentor, please get in touch. In particular, I think (2) is a feasible project that can be completed with minimal mentoring effort. However, I'm only vaguely familiar with the details of the byte code generator, so if (1) is a direction we want to pursue, we would need a mentor familiar with the details of this part of GHC. Cheers, Zubin

I believe Josh has already been working on 2 some time ago? cc'ing him
to this thread.
I'm personally in favor of 2 since it's also super useful for
prototyping whole-program ghc backends, where one can just read all
the CgGuts from the .hi files, and get all codegen-related Core for
free.
Cheers,
Cheng
On Fri, Mar 12, 2021 at 10:32 PM Zubin Duggal
Hi all,
This is following up on this recent discussion on the list concerning fat interface files: https://mail.haskell.org/pipermail/ghc-devs/2020-October/019324.html
Now that we have been accepted as a GSOC organisation, I think it would be a good project idea for a sufficiently motivated and advanced student. This is a call for mentors (and students as well!) who would be interested in this project
The problem is the following:
Haskell Language Server (and ghci with `-fno-code`) have very fast startup times for codebases which don't make use of Template Haskell, and thus don't require any code-gen to typecheck. This is because they can simply read the cached iface files generated by a previous compile and don't need to re-invoke the typechecker.
However, as soon as TH is involved, we are forced to retypecheck and compile files, since it is not possible to restart the code-gen process starting with only a iface file. I can think of two ways to address this problem:
1. Allow bytecode to be serialized
2. Serialize desugared Core into iface files (fat interfaces), so that (byte)code-gen can be restarted from this point and doesn't need
(1) might be challenging, but offers a few more advantages over (2), in that we can reduce the work done to load TH-heavy codebases to just a load of the cached bytecode objects from disk, and could make the load process (and times) for these codebases directly comparable to their TH-free cousins.
It would also make ghci startup a lot faster with a warm cache of bytecode objects, bringing ghci startup times in line with those of -fno-code
However (2) might be much easier to achieve and offers many of the same advantages, in that we would not need to re-run the compiler frontend or core-to-core optimisation phases. There is also already a (slightly bitrotted) implementation of (2) thanks to the work of Edward Yang.
If any of this sounds exciting to you as a student or a mentor, please get in touch.
In particular, I think (2) is a feasible project that can be completed with minimal mentoring effort. However, I'm only vaguely familiar with the details of the byte code generator, so if (1) is a direction we want to pursue, we would need a mentor familiar with the details of this part of GHC.
Cheers, Zubin _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Yes there is also John resumable compilation ideas. And the current
performance work obsidian systems does.
On Sat, 13 Mar 2021 at 6:21 AM, Cheng Shao
I believe Josh has already been working on 2 some time ago? cc'ing him to this thread.
I'm personally in favor of 2 since it's also super useful for prototyping whole-program ghc backends, where one can just read all the CgGuts from the .hi files, and get all codegen-related Core for free.
Cheers, Cheng
On Fri, Mar 12, 2021 at 10:32 PM Zubin Duggal
wrote: Hi all,
This is following up on this recent discussion on the list concerning fat interface files:
https://mail.haskell.org/pipermail/ghc-devs/2020-October/019324.html
Now that we have been accepted as a GSOC organisation, I think it would be a good project idea for a sufficiently motivated and advanced student. This is a call for mentors (and students as well!) who would be interested in this project
The problem is the following:
Haskell Language Server (and ghci with `-fno-code`) have very fast startup times for codebases which don't make use of Template Haskell, and thus don't require any code-gen to typecheck. This is because they can simply read the cached iface files generated by a previous compile and don't need to re-invoke the typechecker.
However, as soon as TH is involved, we are forced to retypecheck and compile files, since it is not possible to restart the code-gen process starting with only a iface file. I can think of two ways to address this problem:
1. Allow bytecode to be serialized
2. Serialize desugared Core into iface files (fat interfaces), so that (byte)code-gen can be restarted from this point and doesn't need
(1) might be challenging, but offers a few more advantages over (2), in that we can reduce the work done to load TH-heavy codebases to just a load of the cached bytecode objects from disk, and could make the load process (and times) for these codebases directly comparable to their TH-free cousins.
It would also make ghci startup a lot faster with a warm cache of bytecode objects, bringing ghci startup times in line with those of -fno-code
However (2) might be much easier to achieve and offers many of the same advantages, in that we would not need to re-run the compiler frontend or core-to-core optimisation phases. There is also already a (slightly bitrotted) implementation of (2) thanks to the work of Edward Yang.
If any of this sounds exciting to you as a student or a mentor, please get in touch.
In particular, I think (2) is a feasible project that can be completed with minimal mentoring effort. However, I'm only vaguely familiar with the details of the byte code generator, so if (1) is a direction we want to pursue, we would need a mentor familiar with the details of this part of GHC.
Cheers, Zubin _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Yes, see https://gitlab.haskell.org/ghc/ghc/-/wikis/Plan-for-increased-parallelism-an... where we (Obsidian) and IOHK have been planning together. I must saw, I am a bit skeptical about a GSOC being able to take this on successfully. I thought Fendor did a great job with multiple home units, for example, but we have still to finish merging all his work! The driver is perhaps the biggest cesspool of technical debt in GHC, and it will take a while to untangle let alone implement new features. I forget what the rules are for more incremental or multifaceted projects, but I would prefer an approach of trying to untangle things with no singular large goal. Or maybe we can involve a student with efforts to improve CI, attacking the root cause for why it's so hard to land things in the first place . John On 3/12/21 7:11 PM, Moritz Angermann wrote:
Yes there is also John resumable compilation ideas. And the current performance work obsidian systems does.
On Sat, 13 Mar 2021 at 6:21 AM, Cheng Shao
mailto:cheng.shao@tweag.io> wrote: I believe Josh has already been working on 2 some time ago? cc'ing him to this thread.
I'm personally in favor of 2 since it's also super useful for prototyping whole-program ghc backends, where one can just read all the CgGuts from the .hi files, and get all codegen-related Core for free.
Cheers, Cheng
On Fri, Mar 12, 2021 at 10:32 PM Zubin Duggal
mailto:zubin.duggal@gmail.com> wrote: > > Hi all, > > This is following up on this recent discussion on the list concerning fat > interface files: https://mail.haskell.org/pipermail/ghc-devs/2020-October/019324.html https://mail.haskell.org/pipermail/ghc-devs/2020-October/019324.html > > Now that we have been accepted as a GSOC organisation, I think > it would be a good project idea for a sufficiently motivated and > advanced student. This is a call for mentors (and students as > well!) who would be interested in this project > > The problem is the following: > > Haskell Language Server (and ghci with `-fno-code`) have very > fast startup times for codebases which don't make use of Template > Haskell, and thus don't require any code-gen to typecheck. This > is because they can simply read the cached iface files generated by a > previous compile and don't need to re-invoke the typechecker. > > However, as soon as TH is involved, we are forced to retypecheck and > compile files, since it is not possible to restart the code-gen process > starting with only a iface file. I can think of two ways to address this > problem: > > 1. Allow bytecode to be serialized > > 2. Serialize desugared Core into iface files (fat interfaces), so that > (byte)code-gen can be restarted from this point and doesn't need > > (1) might be challenging, but offers a few more advantages over (2), > in that we can reduce the work done to load TH-heavy codebases to just > a load of the cached bytecode objects from disk, and could make the > load process (and times) for these codebases directly comparable to > their TH-free cousins. > > It would also make ghci startup a lot faster with a warm cache of > bytecode objects, bringing ghci startup times in line with those of > -fno-code > > However (2) might be much easier to achieve and offers many > of the same advantages, in that we would not need to re-run > the compiler frontend or core-to-core optimisation phases. > There is also already a (slightly bitrotted) implementation > of (2) thanks to the work of Edward Yang. > > If any of this sounds exciting to you as a student or a mentor, please > get in touch. > > In particular, I think (2) is a feasible project that can be completed > with minimal mentoring effort. However, I'm only vaguely familiar with > the details of the byte code generator, so if (1) is a direction we want > to pursue, we would need a mentor familiar with the details of this part > of GHC. > > Cheers, > Zubin > _______________________________________________ > ghc-devs mailing list > ghc-devs@haskell.org mailto:ghc-devs@haskell.org > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org mailto:ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

I'd be happy to mentor anyone on either of these. The CI part is going to
be grueling demotivatinal work with very long pauses in between, which is
why I didn't propose it yet.
I agree with John, that I'm a bit skeptical about a Student being able to
help/pull anything off in the current state how things are with multiple
parties being actively involved in this already, without being relegated to
a spectators position.
On Sat, Mar 13, 2021 at 9:34 AM John Ericson
Yes, see https://gitlab.haskell.org/ghc/ghc/-/wikis/Plan-for-increased-parallelism-an... where we (Obsidian) and IOHK have been planning together.
I must saw, I am a bit skeptical about a GSOC being able to take this on successfully. I thought Fendor did a great job with multiple home units, for example, but we have still to finish merging all his work! The driver is perhaps the biggest cesspool of technical debt in GHC, and it will take a while to untangle let alone implement new features.
I forget what the rules are for more incremental or multifaceted projects, but I would prefer an approach of trying to untangle things with no singular large goal. Or maybe we can involve a student with efforts to improve CI, attacking the root cause for why it's so hard to land things in the first place .
John On 3/12/21 7:11 PM, Moritz Angermann wrote:
Yes there is also John resumable compilation ideas. And the current performance work obsidian systems does.
On Sat, 13 Mar 2021 at 6:21 AM, Cheng Shao
wrote: I believe Josh has already been working on 2 some time ago? cc'ing him to this thread.
I'm personally in favor of 2 since it's also super useful for prototyping whole-program ghc backends, where one can just read all the CgGuts from the .hi files, and get all codegen-related Core for free.
Cheers, Cheng
On Fri, Mar 12, 2021 at 10:32 PM Zubin Duggal
wrote: Hi all,
This is following up on this recent discussion on the list concerning
fat
interface files: https://mail.haskell.org/pipermail/ghc-devs/2020-October/019324.html
Now that we have been accepted as a GSOC organisation, I think it would be a good project idea for a sufficiently motivated and advanced student. This is a call for mentors (and students as well!) who would be interested in this project
The problem is the following:
Haskell Language Server (and ghci with `-fno-code`) have very fast startup times for codebases which don't make use of Template Haskell, and thus don't require any code-gen to typecheck. This is because they can simply read the cached iface files generated by a previous compile and don't need to re-invoke the typechecker.
However, as soon as TH is involved, we are forced to retypecheck and compile files, since it is not possible to restart the code-gen process starting with only a iface file. I can think of two ways to address this problem:
1. Allow bytecode to be serialized
2. Serialize desugared Core into iface files (fat interfaces), so that (byte)code-gen can be restarted from this point and doesn't need
(1) might be challenging, but offers a few more advantages over (2), in that we can reduce the work done to load TH-heavy codebases to just a load of the cached bytecode objects from disk, and could make the load process (and times) for these codebases directly comparable to their TH-free cousins.
It would also make ghci startup a lot faster with a warm cache of bytecode objects, bringing ghci startup times in line with those of -fno-code
However (2) might be much easier to achieve and offers many of the same advantages, in that we would not need to re-run the compiler frontend or core-to-core optimisation phases. There is also already a (slightly bitrotted) implementation of (2) thanks to the work of Edward Yang.
If any of this sounds exciting to you as a student or a mentor, please get in touch.
In particular, I think (2) is a feasible project that can be completed with minimal mentoring effort. However, I'm only vaguely familiar with the details of the byte code generator, so if (1) is a direction we want to pursue, we would need a mentor familiar with the details of this part of GHC.
Cheers, Zubin _______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

John Ericson
Yes, see https://gitlab.haskell.org/ghc/ghc/-/wikis/Plan-for-increased-parallelism-an... where we (Obsidian) and IOHK have been planning together.
I must saw, I am a bit skeptical about a GSOC being able to take this on successfully. I thought Fendor did a great job with multiple home units, for example, but we have still to finish merging all his work! The driver is perhaps the biggest cesspool of technical debt in GHC, and it will take a while to untangle let alone implement new features.
I forget what the rules are for more incremental or multifaceted projects, but I would prefer an approach of trying to untangle things with no singular large goal. Or maybe we can involve a student with efforts to improve CI, attacking the root cause for why it's so hard to land things in the first place .
I think this would be ill-suited to a GSoC project. GSoC projects are strongly encouraged to be measurable projects with a clear development trajectory from the outset and multiple concrete checkpoints. If we want the project to be successful I think it would be a mistake to wander from this guidance. Cheers, - Ben
participants (5)
-
Ben Gamari
-
Cheng Shao
-
John Ericson
-
Moritz Angermann
-
Zubin Duggal