Help implementing Multiline String Literals

Hello! I'm trying to implement #24390 https://gitlab.haskell.org/ghc/ghc/-/issues/24390, which implements the multiline string literals proposal https://github.com/ghc-proposals/ghc-proposals/blob/master/proposals/0569-mu... (existing work done in wip/multiline-strings https://gitlab.haskell.org/ghc/ghc/-/compare/master...wip%2Fmultiline-strings?from_project_id=1&straight=false). I originally suggested adding HsMultilineString to HsLit and translating it to HsString in renaming, then Matthew Pickering suggested I translate it in desugaring instead. I tried going down this approach, but I'm running into two main issues: Escaped characters and Overloaded strings. Apologies in advance for a long email. *TL;DR* - The best implementation I could think of involves a complete rewrite of how strings are lexed and modifying HsString instead of adding a new HsMultilineString constructor. If this is absolutely crazy talk, please dissuade me from this :) ===== Problem 1: Escaped characters ===== Currently, Lexer.x resolves escaped characters for string literals. In the Note [Literal source text], we see that this is intentional; HsString should contain a normalized internal representation. However, multiline string literals have a post-processing step that requires distinguishing between the user typing a newline vs the user typing literally a backslash + an `N` (and other things like knowing if a user typed in `\&`, which currently goes away in lexing as well). Fundamentally, the current logic to resolve escaped characters is specific to the Lexer monad and operates on a per-character basis. But the multiline string literals proposal requires post-processing the whole string, then resolving escaped characters all at once. Possible solutions: (1.1) Duplicate the logic for resolving escaped characters * Pro: Leaves normal string lexing untouched * Con: Two sources of truth, possibly divergent behaviors between multiline and normal strings (1.2) Stick the post-processed string back into P, then rerun normal string lexing to resolve escaped characters * Pro: Leaves normal string lexing untouched * Con: Seems roundabout, inefficient, and hacky (1.3) Refactor the resolve-escaped-characters logic to work in both the P monad and as a pure function `String -> String` * Pro: Reuses same escaped-characters logic for both normal + multiline strings * Con: Different overall behavior between the two string types: Normal string still lexed per-character, Multiline strings would lex everything * Con: Small refactor of lexing normal strings, which could introduce regressions (1.4) Read entire string (both normal + multiline) with no preprocessing (including string gaps or anything, except escaping quote delimiters), and define all post-processing steps as pure `String -> String` functions * Pro: Gets out of monadic code quickly, turn bulk of string logic into pure code * Pro: Processes normal + multiline strings exactly the same * Pro: Opens the door for future string behaviors, e.g. raw string could do the same "read entire string" logic, and just not do any post-processing. * Con: Could be less performant * Con: Major refactor of lexing normal strings, which could introduce regressions I like solution 1.4 the best, as it generalizes string processing behavior the best and is more pipeline-style vs the currently more imperative style. But I recognize possible performance or behavior regressions are a real thing, so if anyone has any thoughts here, I'd love to hear them. ===== Problem 2: Overloaded strings ===== Currently, `HsString s` is converted into `HsOverLit (HsIsString s)` in the renaming phase. Following Matthew's suggestion of resolving multiline string literals in the desugar step, this would mean that multiline string literals are post-processed after OverloadedStrings has already been applied. I don't like any of the solutions this approach brings up: * Do post processing both when Desugaring HsMultilineString AND when Renaming HsMultilineString to HsOverLit - seems wrong to process multiline strings in two different phases * Add HsIsStringMultiline and post process when desugaring both HsMultilineString and HsIsStringMultiline - would ideally like to avoid adding a variant of HsIsStringMultiline Instead, I propose we throw away the HsMultilineString idea and reuse HsString. The multiline syntax would still be preserved in the SourceText, and this also leaves the door open for future string features. For example, if we went with HsMultilineString, then adding raw strings would require adding both HsRawString and HsMultilineRawString. Here are two possible solutions for reusing HsString: (2.1) Add a HsStringType parameter to HsString * HsStringType would define the format of the FastString stored in HsString: Normal => processed, Multiline => stores raw string, needs post-processing * Post processing could occur in desugaring, with or without OverloadedStrings * Pro: Shows the parsed multiline string before processing in -ddump-parsed * Con: HsString containing Multiline strings would not contain the normalized representation mentioned in Note [Literal source text] * Con: Breaking change in the GHC API (2.2) Post-process multiline strings in lexer * Lexer would do all the post processing (for example, in conjunction with solution 1.4) and just return a normal HsString * Pro: Multiline string is immediately desugared and behaves as expected for OverloadedStrings (and any other behaviors of string literals, existing or future) for free * Pro: HsString would still always contain the normalized representation * Con: No way of inspecting the raw multiline parse output before processing, e.g. via -ddump-parsed I'm leaning towards solution 2.1, but curious what people's thoughts are. ===== Closing remarks ===== Again, sorry for the long email. My head is spinning trying to figure out this feature. Any help would be greatly appreciated. As an aside, I last worked on GHC back in 2020 or 2021, and my goodness. The Hadrian build is so much smoother (and faster!? Not sure if it's just my new laptop though) than what it was last time I touched the codebase. Huge thanks to the maintainers, both for the tooling and the docs in the wiki. This is a much more enjoyable experience. Thanks, Brandon

Hi Brandon,
I'm not following all of the details here, but from my naïve
understanding, I would definitely tweak the lexer, do the
post-processing and then have a canonical string representation rather
than waiting until desugaring.
If you like 1.4 best, give it a try. You will seen soon enough if some
performance regression test gets worse. It can't hurt to write a few
yourself either.
I don't think that post-processing the strings would incur too much a
hit compared to compiling those strings and serialise them into an
executable.
I also bet that you can get rid some of the performance problems with
list fusion.
Cheers,
Sebastian
------ Originalnachricht ------
Von: "Brandon Chinn"
Hello!
I'm trying to implement #24390 https://gitlab.haskell.org/ghc/ghc/-/issues/24390, which implements the multiline string literals proposal https://github.com/ghc-proposals/ghc-proposals/blob/master/proposals/0569-mu... (existing work done in wip/multiline-strings https://gitlab.haskell.org/ghc/ghc/-/compare/master...wip%2Fmultiline-strings?from_project_id=1&straight=false). I originally suggested adding HsMultilineString to HsLit and translating it to HsString in renaming, then Matthew Pickering suggested I translate it in desugaring instead. I tried going down this approach, but I'm running into two main issues: Escaped characters and Overloaded strings.
Apologies in advance for a long email. TL;DR - The best implementation I could think of involves a complete rewrite of how strings are lexed and modifying HsString instead of adding a new HsMultilineString constructor. If this is absolutely crazy talk, please dissuade me from this :)
===== Problem 1: Escaped characters ===== Currently, Lexer.x resolves escaped characters for string literals. In the Note [Literal source text], we see that this is intentional; HsString should contain a normalized internal representation. However, multiline string literals have a post-processing step that requires distinguishing between the user typing a newline vs the user typing literally a backslash + an `N` (and other things like knowing if a user typed in `\&`, which currently goes away in lexing as well).
Fundamentally, the current logic to resolve escaped characters is specific to the Lexer monad and operates on a per-character basis. But the multiline string literals proposal requires post-processing the whole string, then resolving escaped characters all at once.
Possible solutions:
(1.1) Duplicate the logic for resolving escaped characters * Pro: Leaves normal string lexing untouched * Con: Two sources of truth, possibly divergent behaviors between multiline and normal strings
(1.2) Stick the post-processed string back into P, then rerun normal string lexing to resolve escaped characters * Pro: Leaves normal string lexing untouched * Con: Seems roundabout, inefficient, and hacky
(1.3) Refactor the resolve-escaped-characters logic to work in both the P monad and as a pure function `String -> String` * Pro: Reuses same escaped-characters logic for both normal + multiline strings * Con: Different overall behavior between the two string types: Normal string still lexed per-character, Multiline strings would lex everything * Con: Small refactor of lexing normal strings, which could introduce regressions
(1.4) Read entire string (both normal + multiline) with no preprocessing (including string gaps or anything, except escaping quote delimiters), and define all post-processing steps as pure `String -> String` functions * Pro: Gets out of monadic code quickly, turn bulk of string logic into pure code * Pro: Processes normal + multiline strings exactly the same * Pro: Opens the door for future string behaviors, e.g. raw string could do the same "read entire string" logic, and just not do any post-processing. * Con: Could be less performant * Con: Major refactor of lexing normal strings, which could introduce regressions
I like solution 1.4 the best, as it generalizes string processing behavior the best and is more pipeline-style vs the currently more imperative style. But I recognize possible performance or behavior regressions are a real thing, so if anyone has any thoughts here, I'd love to hear them.
===== Problem 2: Overloaded strings ===== Currently, `HsString s` is converted into `HsOverLit (HsIsString s)` in the renaming phase. Following Matthew's suggestion of resolving multiline string literals in the desugar step, this would mean that multiline string literals are post-processed after OverloadedStrings has already been applied.
I don't like any of the solutions this approach brings up: * Do post processing both when Desugaring HsMultilineString AND when Renaming HsMultilineString to HsOverLit - seems wrong to process multiline strings in two different phases * Add HsIsStringMultiline and post process when desugaring both HsMultilineString and HsIsStringMultiline - would ideally like to avoid adding a variant of HsIsStringMultiline
Instead, I propose we throw away the HsMultilineString idea and reuse HsString. The multiline syntax would still be preserved in the SourceText, and this also leaves the door open for future string features. For example, if we went with HsMultilineString, then adding raw strings would require adding both HsRawString and HsMultilineRawString.
Here are two possible solutions for reusing HsString:
(2.1) Add a HsStringType parameter to HsString * HsStringType would define the format of the FastString stored in HsString: Normal => processed, Multiline => stores raw string, needs post-processing * Post processing could occur in desugaring, with or without OverloadedStrings * Pro: Shows the parsed multiline string before processing in -ddump-parsed * Con: HsString containing Multiline strings would not contain the normalized representation mentioned in Note [Literal source text] * Con: Breaking change in the GHC API
(2.2) Post-process multiline strings in lexer * Lexer would do all the post processing (for example, in conjunction with solution 1.4) and just return a normal HsString * Pro: Multiline string is immediately desugared and behaves as expected for OverloadedStrings (and any other behaviors of string literals, existing or future) for free * Pro: HsString would still always contain the normalized representation * Con: No way of inspecting the raw multiline parse output before processing, e.g. via -ddump-parsed
I'm leaning towards solution 2.1, but curious what people's thoughts are.
===== Closing remarks ===== Again, sorry for the long email. My head is spinning trying to figure out this feature. Any help would be greatly appreciated.
As an aside, I last worked on GHC back in 2020 or 2021, and my goodness. The Hadrian build is so much smoother (and faster!? Not sure if it's just my new laptop though) than what it was last time I touched the codebase. Huge thanks to the maintainers, both for the tooling and the docs in the wiki. This is a much more enjoyable experience.
Thanks, Brandon

I would imagine you modify the lexer like you describe, but it's not
clear to me you want to use the same constructor `HsString` to
represent them all the way through the compiler.
If you reuse HsString then how to you distinguish between a string
which contains a newline and a multi-line string for example? It just
seems simpler to me to explicitly represent a multi-line string..
perhaps `HsMultiLineString [String]` rather than trying to shoehorn
them together and run into subtle bugs like this.
Matt
On Thu, Feb 8, 2024 at 2:45 PM Sebastian Graf
Hi Brandon,
I'm not following all of the details here, but from my naïve understanding, I would definitely tweak the lexer, do the post-processing and then have a canonical string representation rather than waiting until desugaring. If you like 1.4 best, give it a try. You will seen soon enough if some performance regression test gets worse. It can't hurt to write a few yourself either. I don't think that post-processing the strings would incur too much a hit compared to compiling those strings and serialise them into an executable. I also bet that you can get rid some of the performance problems with list fusion.
Cheers, Sebastian
------ Originalnachricht ------ Von: "Brandon Chinn"
An: ghc-devs@haskell.org Gesendet: 04.02.2024 19:24:19 Betreff: Help implementing Multiline String Literals Hello!
I'm trying to implement #24390, which implements the multiline string literals proposal (existing work done in wip/multiline-strings). I originally suggested adding HsMultilineString to HsLit and translating it to HsString in renaming, then Matthew Pickering suggested I translate it in desugaring instead. I tried going down this approach, but I'm running into two main issues: Escaped characters and Overloaded strings.
Apologies in advance for a long email. TL;DR - The best implementation I could think of involves a complete rewrite of how strings are lexed and modifying HsString instead of adding a new HsMultilineString constructor. If this is absolutely crazy talk, please dissuade me from this :)
===== Problem 1: Escaped characters ===== Currently, Lexer.x resolves escaped characters for string literals. In the Note [Literal source text], we see that this is intentional; HsString should contain a normalized internal representation. However, multiline string literals have a post-processing step that requires distinguishing between the user typing a newline vs the user typing literally a backslash + an `N` (and other things like knowing if a user typed in `\&`, which currently goes away in lexing as well).
Fundamentally, the current logic to resolve escaped characters is specific to the Lexer monad and operates on a per-character basis. But the multiline string literals proposal requires post-processing the whole string, then resolving escaped characters all at once.
Possible solutions:
(1.1) Duplicate the logic for resolving escaped characters * Pro: Leaves normal string lexing untouched * Con: Two sources of truth, possibly divergent behaviors between multiline and normal strings
(1.2) Stick the post-processed string back into P, then rerun normal string lexing to resolve escaped characters * Pro: Leaves normal string lexing untouched * Con: Seems roundabout, inefficient, and hacky
(1.3) Refactor the resolve-escaped-characters logic to work in both the P monad and as a pure function `String -> String` * Pro: Reuses same escaped-characters logic for both normal + multiline strings * Con: Different overall behavior between the two string types: Normal string still lexed per-character, Multiline strings would lex everything * Con: Small refactor of lexing normal strings, which could introduce regressions
(1.4) Read entire string (both normal + multiline) with no preprocessing (including string gaps or anything, except escaping quote delimiters), and define all post-processing steps as pure `String -> String` functions * Pro: Gets out of monadic code quickly, turn bulk of string logic into pure code * Pro: Processes normal + multiline strings exactly the same * Pro: Opens the door for future string behaviors, e.g. raw string could do the same "read entire string" logic, and just not do any post-processing. * Con: Could be less performant * Con: Major refactor of lexing normal strings, which could introduce regressions
I like solution 1.4 the best, as it generalizes string processing behavior the best and is more pipeline-style vs the currently more imperative style. But I recognize possible performance or behavior regressions are a real thing, so if anyone has any thoughts here, I'd love to hear them.
===== Problem 2: Overloaded strings ===== Currently, `HsString s` is converted into `HsOverLit (HsIsString s)` in the renaming phase. Following Matthew's suggestion of resolving multiline string literals in the desugar step, this would mean that multiline string literals are post-processed after OverloadedStrings has already been applied.
I don't like any of the solutions this approach brings up: * Do post processing both when Desugaring HsMultilineString AND when Renaming HsMultilineString to HsOverLit - seems wrong to process multiline strings in two different phases * Add HsIsStringMultiline and post process when desugaring both HsMultilineString and HsIsStringMultiline - would ideally like to avoid adding a variant of HsIsStringMultiline
Instead, I propose we throw away the HsMultilineString idea and reuse HsString. The multiline syntax would still be preserved in the SourceText, and this also leaves the door open for future string features. For example, if we went with HsMultilineString, then adding raw strings would require adding both HsRawString and HsMultilineRawString.
Here are two possible solutions for reusing HsString:
(2.1) Add a HsStringType parameter to HsString * HsStringType would define the format of the FastString stored in HsString: Normal => processed, Multiline => stores raw string, needs post-processing * Post processing could occur in desugaring, with or without OverloadedStrings * Pro: Shows the parsed multiline string before processing in -ddump-parsed * Con: HsString containing Multiline strings would not contain the normalized representation mentioned in Note [Literal source text] * Con: Breaking change in the GHC API
(2.2) Post-process multiline strings in lexer * Lexer would do all the post processing (for example, in conjunction with solution 1.4) and just return a normal HsString * Pro: Multiline string is immediately desugared and behaves as expected for OverloadedStrings (and any other behaviors of string literals, existing or future) for free * Pro: HsString would still always contain the normalized representation * Con: No way of inspecting the raw multiline parse output before processing, e.g. via -ddump-parsed
I'm leaning towards solution 2.1, but curious what people's thoughts are.
===== Closing remarks ===== Again, sorry for the long email. My head is spinning trying to figure out this feature. Any help would be greatly appreciated.
As an aside, I last worked on GHC back in 2020 or 2021, and my goodness. The Hadrian build is so much smoother (and faster!? Not sure if it's just my new laptop though) than what it was last time I touched the codebase. Huge thanks to the maintainers, both for the tooling and the docs in the wiki. This is a much more enjoyable experience.
Thanks, Brandon
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

On Thu, Feb 08, 2024 at 03:09:37PM +0000, Matthew Pickering wrote:
I would imagine you modify the lexer like you describe, but it's not clear to me you want to use the same constructor `HsString` to represent them all the way through the compiler.
If you reuse HsString then how to you distinguish between a string which contains a newline and a multi-line string for example? It just seems simpler to me to explicitly represent a multi-line string.. perhaps `HsMultiLineString [String]` rather than trying to shoehorn them together and run into subtle bugs like this.
Compiler-aside, how first-class are multi-line strings expected to be? Should `Read` instances also be able to handle multi-line strings? Arguably, valid source syntax should be valid syntax for `Read`? -- Viktor.

Thanks Sebastian and Matt!
Matt - can you elaborate, I don't understand your comment. A multiline
string is just syntax sugar for a normal string, so if the lexer does the
post processing, it can be treated as a normal string the rest of the way.
Why does anything else in the compiler need to know if the string was
written as a multiline string?
Or, to rephrase, a multiline string _should_ be semantically
indistinguishable from a normal string with \n characters typed in.
On Thu, Feb 8, 2024, 7:09 AM Matthew Pickering
I would imagine you modify the lexer like you describe, but it's not clear to me you want to use the same constructor `HsString` to represent them all the way through the compiler.
If you reuse HsString then how to you distinguish between a string which contains a newline and a multi-line string for example? It just seems simpler to me to explicitly represent a multi-line string.. perhaps `HsMultiLineString [String]` rather than trying to shoehorn them together and run into subtle bugs like this.
Matt
On Thu, Feb 8, 2024 at 2:45 PM Sebastian Graf
wrote: Hi Brandon,
I'm not following all of the details here, but from my naïve
If you like 1.4 best, give it a try. You will seen soon enough if some
I don't think that post-processing the strings would incur too much a hit compared to compiling those strings and serialise them into an executable. I also bet that you can get rid some of the performance problems with
understanding, I would definitely tweak the lexer, do the post-processing and then have a canonical string representation rather than waiting until desugaring. performance regression test gets worse. It can't hurt to write a few yourself either. list fusion.
Cheers, Sebastian
------ Originalnachricht ------ Von: "Brandon Chinn"
An: ghc-devs@haskell.org Gesendet: 04.02.2024 19:24:19 Betreff: Help implementing Multiline String Literals Hello!
I'm trying to implement #24390, which implements the multiline string
literals proposal (existing work done in wip/multiline-strings). I originally suggested adding HsMultilineString to HsLit and translating it to HsString in renaming, then Matthew Pickering suggested I translate it in desugaring instead. I tried going down this approach, but I'm running into two main issues: Escaped characters and Overloaded strings.
Apologies in advance for a long email. TL;DR - The best implementation I
could think of involves a complete rewrite of how strings are lexed and modifying HsString instead of adding a new HsMultilineString constructor. If this is absolutely crazy talk, please dissuade me from this :)
===== Problem 1: Escaped characters ===== Currently, Lexer.x resolves escaped characters for string literals. In
the Note [Literal source text], we see that this is intentional; HsString should contain a normalized internal representation. However, multiline string literals have a post-processing step that requires distinguishing between the user typing a newline vs the user typing literally a backslash + an `N` (and other things like knowing if a user typed in `\&`, which currently goes away in lexing as well).
Fundamentally, the current logic to resolve escaped characters is
specific to the Lexer monad and operates on a per-character basis. But the multiline string literals proposal requires post-processing the whole string, then resolving escaped characters all at once.
Possible solutions:
(1.1) Duplicate the logic for resolving escaped characters * Pro: Leaves normal string lexing untouched * Con: Two sources of truth, possibly divergent behaviors between
multiline and normal strings
(1.2) Stick the post-processed string back into P, then rerun normal
string lexing to resolve escaped characters
* Pro: Leaves normal string lexing untouched * Con: Seems roundabout, inefficient, and hacky
(1.3) Refactor the resolve-escaped-characters logic to work in both the
P monad and as a pure function `String -> String`
* Pro: Reuses same escaped-characters logic for both normal +
multiline strings
* Con: Different overall behavior between the two string types:
Normal string still lexed per-character, Multiline strings would lex everything
* Con: Small refactor of lexing normal strings, which could
introduce regressions
(1.4) Read entire string (both normal + multiline) with no preprocessing
(including string gaps or anything, except escaping quote delimiters), and define all post-processing steps as pure `String -> String` functions
* Pro: Gets out of monadic code quickly, turn bulk of string logic
into pure code
* Pro: Processes normal + multiline strings exactly the same * Pro: Opens the door for future string behaviors, e.g. raw string
could do the same "read entire string" logic, and just not do any post-processing.
* Con: Could be less performant * Con: Major refactor of lexing normal strings, which could
introduce regressions
I like solution 1.4 the best, as it generalizes string processing
behavior the best and is more pipeline-style vs the currently more imperative style. But I recognize possible performance or behavior regressions are a real thing, so if anyone has any thoughts here, I'd love to hear them.
===== Problem 2: Overloaded strings ===== Currently, `HsString s` is converted into `HsOverLit (HsIsString s)` in
the renaming phase. Following Matthew's suggestion of resolving multiline string literals in the desugar step, this would mean that multiline string literals are post-processed after OverloadedStrings has already been applied.
I don't like any of the solutions this approach brings up: * Do post processing both when Desugaring HsMultilineString AND when
* Add HsIsStringMultiline and post process when desugaring both HsMultilineString and HsIsStringMultiline - would ideally like to avoid adding a variant of HsIsStringMultiline
Instead, I propose we throw away the HsMultilineString idea and reuse HsString. The multiline syntax would still be preserved in the SourceText, and this also leaves the door open for future string features. For example, if we went with HsMultilineString, then adding raw strings would require adding both HsRawString and HsMultilineRawString.
Here are two possible solutions for reusing HsString:
(2.1) Add a HsStringType parameter to HsString * HsStringType would define the format of the FastString stored in HsString: Normal => processed, Multiline => stores raw string, needs
Renaming HsMultilineString to HsOverLit - seems wrong to process multiline strings in two different phases post-processing
* Post processing could occur in desugaring, with or without
OverloadedStrings
* Pro: Shows the parsed multiline string before processing in
-ddump-parsed
* Con: HsString containing Multiline strings would not contain the
normalized representation mentioned in Note [Literal source text]
* Con: Breaking change in the GHC API
(2.2) Post-process multiline strings in lexer * Lexer would do all the post processing (for example, in
conjunction with solution 1.4) and just return a normal HsString
* Pro: Multiline string is immediately desugared and behaves as
expected for OverloadedStrings (and any other behaviors of string literals, existing or future) for free
* Pro: HsString would still always contain the normalized
representation
* Con: No way of inspecting the raw multiline parse output before
processing, e.g. via -ddump-parsed
I'm leaning towards solution 2.1, but curious what people's thoughts are.
===== Closing remarks ===== Again, sorry for the long email. My head is spinning trying to figure
out this feature. Any help would be greatly appreciated.
As an aside, I last worked on GHC back in 2020 or 2021, and my goodness.
The Hadrian build is so much smoother (and faster!? Not sure if it's just my new laptop though) than what it was last time I touched the codebase. Huge thanks to the maintainers, both for the tooling and the docs in the wiki. This is a much more enjoyable experience.
Thanks, Brandon
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

I don't think that is the right way to go.
They are different syntactic forms so they should be distinguished in
the syntax tree.
If I want to generate HsSyn directly, and print it out, how does the
compiler know whether I meant to print a normal string literal or a
multi line string literal? What about if the compiler tries to print
out an expression containing a string literal in an error message,
multi or normal?
Matt
On Thu, Feb 8, 2024 at 3:35 PM Brandon Chinn
Thanks Sebastian and Matt!
Matt - can you elaborate, I don't understand your comment. A multiline string is just syntax sugar for a normal string, so if the lexer does the post processing, it can be treated as a normal string the rest of the way. Why does anything else in the compiler need to know if the string was written as a multiline string?
Or, to rephrase, a multiline string _should_ be semantically indistinguishable from a normal string with \n characters typed in.
On Thu, Feb 8, 2024, 7:09 AM Matthew Pickering
wrote: I would imagine you modify the lexer like you describe, but it's not clear to me you want to use the same constructor `HsString` to represent them all the way through the compiler.
If you reuse HsString then how to you distinguish between a string which contains a newline and a multi-line string for example? It just seems simpler to me to explicitly represent a multi-line string.. perhaps `HsMultiLineString [String]` rather than trying to shoehorn them together and run into subtle bugs like this.
Matt
On Thu, Feb 8, 2024 at 2:45 PM Sebastian Graf
wrote: Hi Brandon,
I'm not following all of the details here, but from my naïve understanding, I would definitely tweak the lexer, do the post-processing and then have a canonical string representation rather than waiting until desugaring. If you like 1.4 best, give it a try. You will seen soon enough if some performance regression test gets worse. It can't hurt to write a few yourself either. I don't think that post-processing the strings would incur too much a hit compared to compiling those strings and serialise them into an executable. I also bet that you can get rid some of the performance problems with list fusion.
Cheers, Sebastian
------ Originalnachricht ------ Von: "Brandon Chinn"
An: ghc-devs@haskell.org Gesendet: 04.02.2024 19:24:19 Betreff: Help implementing Multiline String Literals Hello!
I'm trying to implement #24390, which implements the multiline string literals proposal (existing work done in wip/multiline-strings). I originally suggested adding HsMultilineString to HsLit and translating it to HsString in renaming, then Matthew Pickering suggested I translate it in desugaring instead. I tried going down this approach, but I'm running into two main issues: Escaped characters and Overloaded strings.
Apologies in advance for a long email. TL;DR - The best implementation I could think of involves a complete rewrite of how strings are lexed and modifying HsString instead of adding a new HsMultilineString constructor. If this is absolutely crazy talk, please dissuade me from this :)
===== Problem 1: Escaped characters ===== Currently, Lexer.x resolves escaped characters for string literals. In the Note [Literal source text], we see that this is intentional; HsString should contain a normalized internal representation. However, multiline string literals have a post-processing step that requires distinguishing between the user typing a newline vs the user typing literally a backslash + an `N` (and other things like knowing if a user typed in `\&`, which currently goes away in lexing as well).
Fundamentally, the current logic to resolve escaped characters is specific to the Lexer monad and operates on a per-character basis. But the multiline string literals proposal requires post-processing the whole string, then resolving escaped characters all at once.
Possible solutions:
(1.1) Duplicate the logic for resolving escaped characters * Pro: Leaves normal string lexing untouched * Con: Two sources of truth, possibly divergent behaviors between multiline and normal strings
(1.2) Stick the post-processed string back into P, then rerun normal string lexing to resolve escaped characters * Pro: Leaves normal string lexing untouched * Con: Seems roundabout, inefficient, and hacky
(1.3) Refactor the resolve-escaped-characters logic to work in both the P monad and as a pure function `String -> String` * Pro: Reuses same escaped-characters logic for both normal + multiline strings * Con: Different overall behavior between the two string types: Normal string still lexed per-character, Multiline strings would lex everything * Con: Small refactor of lexing normal strings, which could introduce regressions
(1.4) Read entire string (both normal + multiline) with no preprocessing (including string gaps or anything, except escaping quote delimiters), and define all post-processing steps as pure `String -> String` functions * Pro: Gets out of monadic code quickly, turn bulk of string logic into pure code * Pro: Processes normal + multiline strings exactly the same * Pro: Opens the door for future string behaviors, e.g. raw string could do the same "read entire string" logic, and just not do any post-processing. * Con: Could be less performant * Con: Major refactor of lexing normal strings, which could introduce regressions
I like solution 1.4 the best, as it generalizes string processing behavior the best and is more pipeline-style vs the currently more imperative style. But I recognize possible performance or behavior regressions are a real thing, so if anyone has any thoughts here, I'd love to hear them.
===== Problem 2: Overloaded strings ===== Currently, `HsString s` is converted into `HsOverLit (HsIsString s)` in the renaming phase. Following Matthew's suggestion of resolving multiline string literals in the desugar step, this would mean that multiline string literals are post-processed after OverloadedStrings has already been applied.
I don't like any of the solutions this approach brings up: * Do post processing both when Desugaring HsMultilineString AND when Renaming HsMultilineString to HsOverLit - seems wrong to process multiline strings in two different phases * Add HsIsStringMultiline and post process when desugaring both HsMultilineString and HsIsStringMultiline - would ideally like to avoid adding a variant of HsIsStringMultiline
Instead, I propose we throw away the HsMultilineString idea and reuse HsString. The multiline syntax would still be preserved in the SourceText, and this also leaves the door open for future string features. For example, if we went with HsMultilineString, then adding raw strings would require adding both HsRawString and HsMultilineRawString.
Here are two possible solutions for reusing HsString:
(2.1) Add a HsStringType parameter to HsString * HsStringType would define the format of the FastString stored in HsString: Normal => processed, Multiline => stores raw string, needs post-processing * Post processing could occur in desugaring, with or without OverloadedStrings * Pro: Shows the parsed multiline string before processing in -ddump-parsed * Con: HsString containing Multiline strings would not contain the normalized representation mentioned in Note [Literal source text] * Con: Breaking change in the GHC API
(2.2) Post-process multiline strings in lexer * Lexer would do all the post processing (for example, in conjunction with solution 1.4) and just return a normal HsString * Pro: Multiline string is immediately desugared and behaves as expected for OverloadedStrings (and any other behaviors of string literals, existing or future) for free * Pro: HsString would still always contain the normalized representation * Con: No way of inspecting the raw multiline parse output before processing, e.g. via -ddump-parsed
I'm leaning towards solution 2.1, but curious what people's thoughts are.
===== Closing remarks ===== Again, sorry for the long email. My head is spinning trying to figure out this feature. Any help would be greatly appreciated.
As an aside, I last worked on GHC back in 2020 or 2021, and my goodness. The Hadrian build is so much smoother (and faster!? Not sure if it's just my new laptop though) than what it was last time I touched the codebase. Huge thanks to the maintainers, both for the tooling and the docs in the wiki. This is a much more enjoyable experience.
Thanks, Brandon
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

That's a good point. You want distinguishable lexemes and then one CST constructor per string type. I.e., maintain the separation for longer, until desugaring.
________________________________
Von: Matthew Pickering
Thanks Sebastian and Matt!
Matt - can you elaborate, I don't understand your comment. A multiline string is just syntax sugar for a normal string, so if the lexer does the post processing, it can be treated as a normal string the rest of the way. Why does anything else in the compiler need to know if the string was written as a multiline string?
Or, to rephrase, a multiline string _should_ be semantically indistinguishable from a normal string with \n characters typed in.
On Thu, Feb 8, 2024, 7:09 AM Matthew Pickering
wrote: I would imagine you modify the lexer like you describe, but it's not clear to me you want to use the same constructor `HsString` to represent them all the way through the compiler.
If you reuse HsString then how to you distinguish between a string which contains a newline and a multi-line string for example? It just seems simpler to me to explicitly represent a multi-line string.. perhaps `HsMultiLineString [String]` rather than trying to shoehorn them together and run into subtle bugs like this.
Matt
On Thu, Feb 8, 2024 at 2:45 PM Sebastian Graf
wrote: Hi Brandon,
I'm not following all of the details here, but from my naïve understanding, I would definitely tweak the lexer, do the post-processing and then have a canonical string representation rather than waiting until desugaring. If you like 1.4 best, give it a try. You will seen soon enough if some performance regression test gets worse. It can't hurt to write a few yourself either. I don't think that post-processing the strings would incur too much a hit compared to compiling those strings and serialise them into an executable. I also bet that you can get rid some of the performance problems with list fusion.
Cheers, Sebastian
------ Originalnachricht ------ Von: "Brandon Chinn"
An: ghc-devs@haskell.org Gesendet: 04.02.2024 19:24:19 Betreff: Help implementing Multiline String Literals Hello!
I'm trying to implement #24390, which implements the multiline string literals proposal (existing work done in wip/multiline-strings). I originally suggested adding HsMultilineString to HsLit and translating it to HsString in renaming, then Matthew Pickering suggested I translate it in desugaring instead. I tried going down this approach, but I'm running into two main issues: Escaped characters and Overloaded strings.
Apologies in advance for a long email. TL;DR - The best implementation I could think of involves a complete rewrite of how strings are lexed and modifying HsString instead of adding a new HsMultilineString constructor. If this is absolutely crazy talk, please dissuade me from this :)
===== Problem 1: Escaped characters ===== Currently, Lexer.x resolves escaped characters for string literals. In the Note [Literal source text], we see that this is intentional; HsString should contain a normalized internal representation. However, multiline string literals have a post-processing step that requires distinguishing between the user typing a newline vs the user typing literally a backslash + an `N` (and other things like knowing if a user typed in `\&`, which currently goes away in lexing as well).
Fundamentally, the current logic to resolve escaped characters is specific to the Lexer monad and operates on a per-character basis. But the multiline string literals proposal requires post-processing the whole string, then resolving escaped characters all at once.
Possible solutions:
(1.1) Duplicate the logic for resolving escaped characters * Pro: Leaves normal string lexing untouched * Con: Two sources of truth, possibly divergent behaviors between multiline and normal strings
(1.2) Stick the post-processed string back into P, then rerun normal string lexing to resolve escaped characters * Pro: Leaves normal string lexing untouched * Con: Seems roundabout, inefficient, and hacky
(1.3) Refactor the resolve-escaped-characters logic to work in both the P monad and as a pure function `String -> String` * Pro: Reuses same escaped-characters logic for both normal + multiline strings * Con: Different overall behavior between the two string types: Normal string still lexed per-character, Multiline strings would lex everything * Con: Small refactor of lexing normal strings, which could introduce regressions
(1.4) Read entire string (both normal + multiline) with no preprocessing (including string gaps or anything, except escaping quote delimiters), and define all post-processing steps as pure `String -> String` functions * Pro: Gets out of monadic code quickly, turn bulk of string logic into pure code * Pro: Processes normal + multiline strings exactly the same * Pro: Opens the door for future string behaviors, e.g. raw string could do the same "read entire string" logic, and just not do any post-processing. * Con: Could be less performant * Con: Major refactor of lexing normal strings, which could introduce regressions
I like solution 1.4 the best, as it generalizes string processing behavior the best and is more pipeline-style vs the currently more imperative style. But I recognize possible performance or behavior regressions are a real thing, so if anyone has any thoughts here, I'd love to hear them.
===== Problem 2: Overloaded strings ===== Currently, `HsString s` is converted into `HsOverLit (HsIsString s)` in the renaming phase. Following Matthew's suggestion of resolving multiline string literals in the desugar step, this would mean that multiline string literals are post-processed after OverloadedStrings has already been applied.
I don't like any of the solutions this approach brings up: * Do post processing both when Desugaring HsMultilineString AND when Renaming HsMultilineString to HsOverLit - seems wrong to process multiline strings in two different phases * Add HsIsStringMultiline and post process when desugaring both HsMultilineString and HsIsStringMultiline - would ideally like to avoid adding a variant of HsIsStringMultiline
Instead, I propose we throw away the HsMultilineString idea and reuse HsString. The multiline syntax would still be preserved in the SourceText, and this also leaves the door open for future string features. For example, if we went with HsMultilineString, then adding raw strings would require adding both HsRawString and HsMultilineRawString.
Here are two possible solutions for reusing HsString:
(2.1) Add a HsStringType parameter to HsString * HsStringType would define the format of the FastString stored in HsString: Normal => processed, Multiline => stores raw string, needs post-processing * Post processing could occur in desugaring, with or without OverloadedStrings * Pro: Shows the parsed multiline string before processing in -ddump-parsed * Con: HsString containing Multiline strings would not contain the normalized representation mentioned in Note [Literal source text] * Con: Breaking change in the GHC API
(2.2) Post-process multiline strings in lexer * Lexer would do all the post processing (for example, in conjunction with solution 1.4) and just return a normal HsString * Pro: Multiline string is immediately desugared and behaves as expected for OverloadedStrings (and any other behaviors of string literals, existing or future) for free * Pro: HsString would still always contain the normalized representation * Con: No way of inspecting the raw multiline parse output before processing, e.g. via -ddump-parsed
I'm leaning towards solution 2.1, but curious what people's thoughts are.
===== Closing remarks ===== Again, sorry for the long email. My head is spinning trying to figure out this feature. Any help would be greatly appreciated.
As an aside, I last worked on GHC back in 2020 or 2021, and my goodness. The Hadrian build is so much smoother (and faster!? Not sure if it's just my new laptop though) than what it was last time I touched the codebase. Huge thanks to the maintainers, both for the tooling and the docs in the wiki. This is a much more enjoyable experience.
Thanks, Brandon
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
participants (4)
-
Brandon Chinn
-
Matthew Pickering
-
Sebastian Graf
-
Viktor Dukhovni