Proposal: Roundtrip serialization of Cmm (parser-compatible pretty-printer output)
Hello GHC devs, I'm currently working on Cmm documentation and tooling improvements as part of my Google Summer of Code project. One of my core goals is to make Cmm roundtrip serializable. Right now, the in-memory Cmm data structure—generated programmatically (e.g., from STG via GHC)—can be pretty-printed, and Cmm can also be parsed. However, the pretty-printed version is not compatible with the parser. That is, we cannot take the output of the pretty printer and feed it directly back into the parser. Example: Parseable version: sum { cr: bits64 x; x = R1 + R2; R1 = x; jump %ENTRY_CODE(Sp(0))[R1]; } Pretty-printed version: sum() { // [] { info_tbls: [] stack_info: arg_space: 8 } {offset cf: // global _ce::I64 = R1 + R2; R1 = _ce::I64; call (I64[Sp + 0 * 8])(R1) args: 8, res: 0, upd: 8; } } Another example: Parseable version: simple_sum_4 { // [R2, R1] cr: // global bits64 _cq; _cq = R2; bits64 _cp; _cp = R1; R1 = _cq + _cp; jump (bits64[Sp])[R1]; } Pretty-printed version: simple_sum_4() { // [] { info_tbls: [] stack_info: arg_space: 8 } {offset cs: // global _cq::I64 = R2; _cr::I64 = R1; R1 = _cq::I64 + _cr::I64; call (I64[Sp])(R1) args: 8, res: 0, upd: 8; } } While it’s possible to write parseable Cmm that resembles the pretty-printed version (and hence the internal ADT), they don’t fully match—mainly because the parser inserts inferred fields using convenience functions. Proposal: To make roundtrip serialization possible, I propose supporting a new syntax that matches the pretty printer output exactly. There are a couple of design options: 1. Create a separate parser that accepts the pretty-printed syntax. Files could then use either the current parser or the new strict one. 2. Extend the current parser with a dedicated block syntax like: low_level_unwrapped { ... } This second option is the one my mentor recommends, as it may better reflect GHC developers' preferences. In this mode, the parser would not insert any inferred data and would expect the input to match the pretty-printed form exactly. This would enable a true roundtrip: - Compile Haskell to Cmm (in-memory AST) - Pretty-print and write it to disk (wrapped in low_level_unwrapped { ... }) - Later read it back using the parser and continue with codegen Optional future direction: As a side note: currently the parser has both a “high-level” and a “low-level” mode. The low-level mode resembles the AST more closely but still inserts some inferred data. If we introduce this new “exact” low-level form, it's possible the existing low-level mode could become redundant. We might then have: - High-level syntax - New low-level (exact) - And possibly deprecate the current low-level variant I’d be interested in your thoughts on whether that direction makes sense. Serialization libraries? One technically possible—but likely unacceptable—alternative would be to derive serialization via a library like aeson. That would enable serializing and deserializing the Cmm AST directly. However, I understand that aeson adds a large dependency footprint, and likely wouldn't be suitable for inclusion in GHC. Final question: Lastly—I’ve heard that parts of the Cmm pipeline may currently be under refactoring. If that’s the case, could you point me to which parts (parser, pretty printer, internal representation, etc.) are being modified? I’d like to align my efforts accordingly and avoid conflicts. Thanks very much for your time and input! I'm happy to iterate on this based on your feedback. Best regards, Diego Antonio Rosario Palomino GSoC 2025 – Cmm Documentation & Tooling
The idea of making Cmm roundtripable comes up every now and then. While the ability to feed dump output to GHC for debugging or similar purposes is useful In the end we always ended up prioritizing one of the many other things that needed doing. Or in other words making Cmm (more) roundtripable seems inherently useful. However it's questionably how much it is worth breaking things like .cmm code that exists in libraries for it. So if you want to work towards this it should be with the goal to avoid breakage. There are likely also a lot of corner cases to consider. Which might make this more complicated then it sounds. Ultimately this is up to you and your mentor. But if I understand correctly you have about 5 weeks left for GSoC so getting full Cmm roundtrip ability into a state where it can be merged into GHC during that time might be too optimistic depending on your haskell/parser/GHC experience. As a GHC maintainer for us the most useful thing therefore would be incremental patches which take Cmm closer to being roundtripable. And that would allow you to get at least some work that benefits the GHC project into the tree even if you end up not making it all the way to full roundtrip capability. On the pure technical aspects: -------------
Create a separate parser ...
1. Creating a separate parser is not viable. It would likely bitrot and break on the next change to Cmm and only causes increased maintenance overhead. At least not if you want the GHC team to maintain it.
Extend the current parser with a dedicated block Having blocks ala C seems fine. Your suggestion seems different however. It's unclear from your example how those blocks would work exactly. Is `|low_level_unwrapped` |a label. If so can we goto to it? Is it a keyword? Something else entirely?
If the main issue is the "offset" string in the generated case I'm fine with deleting that from the pretty printer. I'm not sure that does anything of value so removing it from the output seems fine. (See pprCmmGraph).
If we introduce this new “exact” low-level form, it's possible the existing low-level mode could become redundant. We might then have:
What changes are you planning that make the new parser/syntax incompatible with the old one? Can't you just modify the current parser, maybe with some slight changes to the pretty printer, in a way that makes it mostly backwards compatible?
|aeson| adds a large dependency footprint, and likely wouldn't be suitable for inclusion in GHC.
Yes aeson seems unsuitable.
Lastly—I’ve heard that parts of the Cmm pipeline may currently be under refactoring.
This is the first time I hear of this so I wonder where this information came from? There could always be changes to those sorts of things, because at the end of the day they are compiler internals. But I'm not aware of any big planned changes in the near future. Cheers Andreas On 28/07/2025 02:16, Diego Antonio Rosario Palomino wrote:
Hello GHC devs,
I'm currently working on Cmm documentation and tooling improvements as part of my Google Summer of Code project. One of my core goals is to make Cmm roundtrip serializable.
Right now, the in-memory Cmm data structure—generated programmatically (e.g., from STG via GHC)—can be pretty-printed, and Cmm can also be parsed. However, the pretty-printed version is not compatible with the parser. That is, we cannot take the output of the pretty printer and feed it directly back into the parser.
Example:
Parseable version:
|sum { cr: bits64 x; x = R1 + R2; R1 = x; jump %ENTRY_CODE(Sp(0))[R1]; } |
Pretty-printed version:
|sum() { // [] { info_tbls: [] stack_info: arg_space: 8 } {offset cf: // global _ce::I64 = R1 + R2; R1 = _ce::I64; call (I64[Sp + 0 * 8])(R1) args: 8, res: 0, upd: 8; } } |
Another example:
Parseable version:
|simple_sum_4 { // [R2, R1] cr: // global bits64 _cq; _cq = R2; bits64 _cp; _cp = R1; R1 = _cq + _cp; jump (bits64[Sp])[R1]; } |
Pretty-printed version:
|simple_sum_4() { // [] { info_tbls: [] stack_info: arg_space: 8 } {offset cs: // global _cq::I64 = R2; _cr::I64 = R1; R1 = _cq::I64 + _cr::I64; call (I64[Sp])(R1) args: 8, res: 0, upd: 8; } } |
While it’s possible to write parseable Cmm that resembles the pretty-printed version (and hence the internal ADT), they don’t fully match—mainly because the parser inserts inferred fields using convenience functions.
Proposal:
To make roundtrip serialization possible, I propose supporting a new syntax that matches the pretty printer output exactly.
There are a couple of design options:
1.
Create a separate parser that accepts the pretty-printed syntax. Files could then use either the current parser or the new strict one.
2.
Extend the current parser with a dedicated block syntax like:
|low_level_unwrapped { ... } |
This second option is the one my mentor recommends, as it may better reflect GHC developers' preferences. In this mode, the parser would not insert any inferred data and would expect the input to match the pretty-printed form exactly.
This would enable a true roundtrip:
*
Compile Haskell to Cmm (in-memory AST)
*
Pretty-print and write it to disk (wrapped in low_level_unwrapped { ... })
*
Later read it back using the parser and continue with codegen
Optional future direction:
As a side note: currently the parser has both a “high-level” and a “low-level” mode. The low-level mode resembles the AST more closely but still inserts some inferred data.
If we introduce this new “exact” low-level form, it's possible the existing low-level mode could become redundant. We might then have:
*
High-level syntax
*
New low-level (exact)
*
And possibly deprecate the current low-level variant
I’d be interested in your thoughts on whether that direction makes sense.
Serialization libraries?
One technically possible—but likely unacceptable—alternative would be to derive serialization via a library like |aeson|. That would enable serializing and deserializing the Cmm AST directly. However, I understand that |aeson| adds a large dependency footprint, and likely wouldn't be suitable for inclusion in GHC.
Final question:
Lastly—I’ve heard that parts of the Cmm pipeline may currently be under refactoring. If that’s the case, could you point me to which parts (parser, pretty printer, internal representation, etc.) are being modified? I’d like to align my efforts accordingly and avoid conflicts.
Thanks very much for your time and input! I'm happy to iterate on this based on your feedback.
Best regards, Diego Antonio Rosario Palomino GSoC 2025 – Cmm Documentation & Tooling
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Diego Like Andreas says, in general being able to parse the output that GHC itself produces would be a good idea. A few thoughts - Do you have any use-cases in mind? Suppose you were 100% successful -- would anyone use it? - - You need a compelling reason to change the input language (understood by the parser) since libraries may include .cmm files, which will break. (It'd be interesting to audit Hackage to see how many libraries do include such .cmm files.) - Rather than change the language understood by the parser, would it not be easier to change the language spat out by the pretty-printer to be compatible with the parser? Simon On Mon, 28 Jul 2025 at 08:46, Andreas Klebinger via ghc-devs < ghc-devs@haskell.org> wrote:
The idea of making Cmm roundtripable comes up every now and then. While the ability to feed dump output to GHC for debugging or similar purposes is useful In the end we always ended up prioritizing one of the many other things that needed doing.
Or in other words making Cmm (more) roundtripable seems inherently useful. However it's questionably how much it is worth breaking things like .cmm code that exists in libraries for it. So if you want to work towards this it should be with the goal to avoid breakage.
There are likely also a lot of corner cases to consider. Which might make this more complicated then it sounds. Ultimately this is up to you and your mentor. But if I understand correctly you have about 5 weeks left for GSoC so getting full Cmm roundtrip ability into a state where it can be merged into GHC during that time might be too optimistic depending on your haskell/parser/GHC experience.
As a GHC maintainer for us the most useful thing therefore would be incremental patches which take Cmm closer to being roundtripable. And that would allow you to get at least some work that benefits the GHC project into the tree even if you end up not making it all the way to full roundtrip capability.
On the pure technical aspects: -------------
Create a separate parser ...
1. Creating a separate parser is not viable. It would likely bitrot and break on the next change to Cmm and only causes increased maintenance overhead. At least not if you want the GHC team to maintain it.
Extend the current parser with a dedicated block
Having blocks ala C seems fine. Your suggestion seems different however. It's unclear from your example how those blocks would work exactly. Is ` low_level_unwrapped` a label. If so can we goto to it? Is it a keyword? Something else entirely?
If the main issue is the "offset" string in the generated case I'm fine with deleting that from the pretty printer. I'm not sure that does anything of value so removing it from the output seems fine. (See pprCmmGraph).
If we introduce this new “exact” low-level form, it's possible the existing low-level mode could become redundant. We might then have:
What changes are you planning that make the new parser/syntax incompatible with the old one? Can't you just modify the current parser, maybe with some slight changes to the pretty printer, in a way that makes it mostly backwards compatible?
aeson adds a large dependency footprint, and likely wouldn't be suitable for inclusion in GHC.
Yes aeson seems unsuitable.
Lastly—I’ve heard that parts of the Cmm pipeline may currently be under refactoring.
This is the first time I hear of this so I wonder where this information came from? There could always be changes to those sorts of things, because at the end of the day they are compiler internals. But I'm not aware of any big planned changes in the near future.
Cheers Andreas On 28/07/2025 02:16, Diego Antonio Rosario Palomino wrote:
Hello GHC devs,
I'm currently working on Cmm documentation and tooling improvements as part of my Google Summer of Code project. One of my core goals is to make Cmm roundtrip serializable.
Right now, the in-memory Cmm data structure—generated programmatically (e.g., from STG via GHC)—can be pretty-printed, and Cmm can also be parsed. However, the pretty-printed version is not compatible with the parser. That is, we cannot take the output of the pretty printer and feed it directly back into the parser.
Example:
Parseable version:
sum { cr: bits64 x; x = R1 + R2; R1 = x; jump %ENTRY_CODE(Sp(0))[R1]; }
Pretty-printed version:
sum() { // [] { info_tbls: [] stack_info: arg_space: 8 } {offset cf: // global _ce::I64 = R1 + R2; R1 = _ce::I64; call (I64[Sp + 0 * 8])(R1) args: 8, res: 0, upd: 8; } }
Another example:
Parseable version:
simple_sum_4 { // [R2, R1] cr: // global bits64 _cq; _cq = R2; bits64 _cp; _cp = R1; R1 = _cq + _cp; jump (bits64[Sp])[R1]; }
Pretty-printed version:
simple_sum_4() { // [] { info_tbls: [] stack_info: arg_space: 8 } {offset cs: // global _cq::I64 = R2; _cr::I64 = R1; R1 = _cq::I64 + _cr::I64; call (I64[Sp])(R1) args: 8, res: 0, upd: 8; } }
While it’s possible to write parseable Cmm that resembles the pretty-printed version (and hence the internal ADT), they don’t fully match—mainly because the parser inserts inferred fields using convenience functions.
Proposal:
To make roundtrip serialization possible, I propose supporting a new syntax that matches the pretty printer output exactly.
There are a couple of design options:
1.
Create a separate parser that accepts the pretty-printed syntax. Files could then use either the current parser or the new strict one. 2.
Extend the current parser with a dedicated block syntax like:
low_level_unwrapped { ... }
This second option is the one my mentor recommends, as it may better reflect GHC developers' preferences. In this mode, the parser would not insert any inferred data and would expect the input to match the pretty-printed form exactly.
This would enable a true roundtrip:
-
Compile Haskell to Cmm (in-memory AST) -
Pretty-print and write it to disk (wrapped in low_level_unwrapped { ... }) -
Later read it back using the parser and continue with codegen
Optional future direction:
As a side note: currently the parser has both a “high-level” and a “low-level” mode. The low-level mode resembles the AST more closely but still inserts some inferred data.
If we introduce this new “exact” low-level form, it's possible the existing low-level mode could become redundant. We might then have:
-
High-level syntax -
New low-level (exact) -
And possibly deprecate the current low-level variant
I’d be interested in your thoughts on whether that direction makes sense.
Serialization libraries?
One technically possible—but likely unacceptable—alternative would be to derive serialization via a library like aeson. That would enable serializing and deserializing the Cmm AST directly. However, I understand that aeson adds a large dependency footprint, and likely wouldn't be suitable for inclusion in GHC.
Final question:
Lastly—I’ve heard that parts of the Cmm pipeline may currently be under refactoring. If that’s the case, could you point me to which parts (parser, pretty printer, internal representation, etc.) are being modified? I’d like to align my efforts accordingly and avoid conflicts.
Thanks very much for your time and input! I'm happy to iterate on this based on your feedback.
Best regards, Diego Antonio Rosario Palomino GSoC 2025 – Cmm Documentation & Tooling
_______________________________________________ ghc-devs mailing listghc-devs@haskell.orghttp://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Hi Diego, Thank you very much for your work in this direction, it's sorely needed. I'm all for having proper roundtrip correctness for Cmm, but I am not sure altering the parser is the way to go. In my opinion, GHC should produce valid textual Cmm, that can be ingested by the parser at it is today. Have a nice day, Hécate Le 28/07/2025 à 02:16, Diego Antonio Rosario Palomino a écrit :
Hello GHC devs,
I'm currently working on Cmm documentation and tooling improvements as part of my Google Summer of Code project. One of my core goals is to make Cmm roundtrip serializable.
Right now, the in-memory Cmm data structure—generated programmatically (e.g., from STG via GHC)—can be pretty-printed, and Cmm can also be parsed. However, the pretty-printed version is not compatible with the parser. That is, we cannot take the output of the pretty printer and feed it directly back into the parser.
Example:
Parseable version:
|sum { cr: bits64 x; x = R1 + R2; R1 = x; jump %ENTRY_CODE(Sp(0))[R1]; } |
Pretty-printed version:
|sum() { // [] { info_tbls: [] stack_info: arg_space: 8 } {offset cf: // global _ce::I64 = R1 + R2; R1 = _ce::I64; call (I64[Sp + 0 * 8])(R1) args: 8, res: 0, upd: 8; } } |
Another example:
Parseable version:
|simple_sum_4 { // [R2, R1] cr: // global bits64 _cq; _cq = R2; bits64 _cp; _cp = R1; R1 = _cq + _cp; jump (bits64[Sp])[R1]; } |
Pretty-printed version:
|simple_sum_4() { // [] { info_tbls: [] stack_info: arg_space: 8 } {offset cs: // global _cq::I64 = R2; _cr::I64 = R1; R1 = _cq::I64 + _cr::I64; call (I64[Sp])(R1) args: 8, res: 0, upd: 8; } } |
While it’s possible to write parseable Cmm that resembles the pretty-printed version (and hence the internal ADT), they don’t fully match—mainly because the parser inserts inferred fields using convenience functions.
Proposal:
To make roundtrip serialization possible, I propose supporting a new syntax that matches the pretty printer output exactly.
There are a couple of design options:
1.
Create a separate parser that accepts the pretty-printed syntax. Files could then use either the current parser or the new strict one.
2.
Extend the current parser with a dedicated block syntax like:
|low_level_unwrapped { ... } |
This second option is the one my mentor recommends, as it may better reflect GHC developers' preferences. In this mode, the parser would not insert any inferred data and would expect the input to match the pretty-printed form exactly.
This would enable a true roundtrip:
*
Compile Haskell to Cmm (in-memory AST)
*
Pretty-print and write it to disk (wrapped in low_level_unwrapped { ... })
*
Later read it back using the parser and continue with codegen
Optional future direction:
As a side note: currently the parser has both a “high-level” and a “low-level” mode. The low-level mode resembles the AST more closely but still inserts some inferred data.
If we introduce this new “exact” low-level form, it's possible the existing low-level mode could become redundant. We might then have:
*
High-level syntax
*
New low-level (exact)
*
And possibly deprecate the current low-level variant
I’d be interested in your thoughts on whether that direction makes sense.
Serialization libraries?
One technically possible—but likely unacceptable—alternative would be to derive serialization via a library like |aeson|. That would enable serializing and deserializing the Cmm AST directly. However, I understand that |aeson| adds a large dependency footprint, and likely wouldn't be suitable for inclusion in GHC.
Final question:
Lastly—I’ve heard that parts of the Cmm pipeline may currently be under refactoring. If that’s the case, could you point me to which parts (parser, pretty printer, internal representation, etc.) are being modified? I’d like to align my efforts accordingly and avoid conflicts.
Thanks very much for your time and input! I'm happy to iterate on this based on your feedback.
Best regards, Diego Antonio Rosario Palomino GSoC 2025 – Cmm Documentation & Tooling
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
-- Hécate ✨ 🐦: @TechnoEmpress IRC: Hecate WWW:https://glitchbra.in RUN: BSD
participants (4)
-
Andreas Klebinger -
Diego Antonio Rosario Palomino -
Hécate -
Simon Peyton Jones