[RFC] Support Unicode characters in instance Show String

Hi all Two weeks ago, I proposed “Support Unicode characters in instance Show String” [0] in the GHC issue tracker, and chessai asked me to post it here for wider feedback. The proposal posted here is edited to reflect new ideas proposed and insights accumulated over the days: 1. (Proposal) Now the proposal itself is now modeled after Python. 2. (Alternative Options) Alternative 2 is the original proposal. 3. (Downsides) New. About breakage. 4. (Prior Art) New. 5. (Unresolved Problems) New. Included for discussion. Even though I wanted to summarize everything here, some insightful comments are perhaps not included or misunderstood. These original comments can be found at the original feature request. [0] https://gitlab.haskell.org/ghc/ghc/-/issues/20027 Motivation ========== Unicode has been widely adopted and people around the world rely on Unicode to write in their native languages. Haskell, however, has been stuck in ASCII, and escape all non-ASCII characters in the String's instance of the Showclass, despite the fact that each element of a String is typically a Unicode code point, and putStrLn actually works as expected. Consider the following examples: ghci> print "Hello, 世界” "Hello, \19990\30028” ghci> print "Hello, мир” "Hello, \1084\1080\1088” ghci> print "Hello, κόσμος” "Hello, \954\972\963\956\959\962” ghci> "Hello, 世界" -- ghci calls `show`, so string literals are also escaped "Hello, \19990\30028” ghci> "😀" -- Not only human scripts, but also emojis! "\128512” This status quo is unsatisfactory for a number of reasons: 1. Even though it's small, it somehow creates an unwelcoming atmosphere for native speakers of languages whose scripts are not representable in ASCII. 2. This is an actual annoyance during debugging localized software, or strings with emojis. 3. Following 1, Haskell teachers are forced to use other languages instead of the students' mother tongues, or relying on I/O functions like putStrLn, creating a rather unnecessary burden. 4. Other string types, like Text [1], rely on this Show instance. Moreover, `read` already can handle Unicode strings today, so relaxing constraints on `show` doesn't affect `read . show == id`. Proposal ======== It's proposed here to change the Show instance of String, to achieve the following output: ghci> print "Hello, 世界” "Hello, 世界” ghci> print "Hello, мир” "Hello, мир” ghci> print "Hello, κόσμος” "Hello, κόσμος” ghci> "Hello, 世界” “Hello, 世界” ghci> "😀” “😀" More concretely, it means: 1. Modify a few guards in GHC.Show.showLitChar to not escape _readable_ Unicode characters out of the range of ASCII. 2. Provide a function showEscaped or newtype Escaped = Escaped String to obtain the current escaping behavior, in case anyone wants the current behavior back. This proposal isn't about unescaping everything, but only readable Unicode characters. u_iswprint (GHC.Unicode.isPrint) seems to do the job, and indeed, there was a similar proposal before [2]. In summary, the behavior is similar to what Python `repr` does. Alternative Options =================== 1. Always use putStrLn. This is viable today but unsatisfactory as it requires stdout. In some cases, stdout is not accessible, e.g. Telegram or Discord bots. 2. Don't escape anything. `show` itself refrains from escaping most of the characters, and let ghci do the job instead. 3. Customize ghci instead. ghci intercepts output strings and check if they can be converted back to readable characters. This potentially allows for better compatibility with a variety of strangely behaving terminals, and finer-grained user control. Tom Ellis proposed `-interactive-print`-based solutions in the comment section. 4. A new language extension, e.g. ShowStringUnicode. Proposed by Julian Ospald. When enabled, readable Unicode characters are not escaped, and this is enabled by default by ghci. There are concerns about how this would affect cross-module behavior. Downsides ========= This is definitely a breaking change, but the breakage, to our current understanding, is limited. First, use of `show` in production code is discouraged. Even if someone really does that, the breakage only happens when one tries to send the "serialized" data over wire: Suppose Machine A `show`-ed a string and saved it into a UTF-8-encoded file, and sends it to Machine B, which expects another encoding. This would be surprising for those who are used to the old behavior. Second, though the breakage is not likely to be catastrophic for correct production code, test suites could be badly affected, as pointed out by Oleg Grenrus and vdukhovni in the comment section. Some test suites compare `show` results with expected results. vdukhovni further commented that Haskell escapes are not universally supported by non-Haskell tools, so the impact would be confined to Haskell. Prior Art ========= Python supports Unicode natively since 3. Python's approach is intuitive and capable. Its `repr`, which is equivalent to Haskell's `show`, automatically escapes unreadable characters, but leaves readable characters unescaped. The criteria of "readable" can be found in CPython's code [3]. If we were to realize this proposal, Python could be a source of inspiration. Unresolved Problems =================== There are some currently unresolved (not discussed enough) issues. + Locales. What if the specified locale does not support Unicode? Hécate Moonlight pointed out PEP-538 [4] could be a reference. + Unicode versions. Javran Cheng pointed out u_iswprint is generated from a Unicode table, which is manually updated. This raises a concern that the definition of "printable" characters could change from version to version. + Definition of "readable". Unicode already defined "printability". It's good, but it is not necessarily what we want here. - Should we support RTL? - Should we design a Haskell-specific definition of readability, to avoid Unciode version silently introducing breakage? (More?) Some issues here perhaps require better answers to: What is our expectation of Show? Where should it be used? Should we expect it to break on every Unicode update? [1] https://hackage.haskell.org/package/text-1.2.4.1/docs/src/Data.Text.Show.htm... [2] https://mail.haskell.org/pipermail/haskell-cafe/2016-February/122874.html [3] https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f797... [4] https://www.python.org/dev/peps/pep-0538/

Hi,
I think most seemed to agree on the motivation, but would it be a lot of work to ping a few large opensource/industry projects about this and get a feel what they think or how much of an expected effort a migration would be? I'm afraid that we might take this too lightly and possibly cause a lot of engineering effort here. Our expectations how or how often people use "show" might or might not be accurate.
I'm aware of e.g. the cardano wallet test suite (open source) and other cardano projects that are very large opon source codebases and may be affected.
CCing duncan
On July 8, 2021 10:11:28 AM UTC, Kai Ma
Hi all
Two weeks ago, I proposed “Support Unicode characters in instance Show String” [0] in the GHC issue tracker, and chessai asked me to post it here for wider feedback. The proposal posted here is edited to reflect new ideas proposed and insights accumulated over the days:
1. (Proposal) Now the proposal itself is now modeled after Python. 2. (Alternative Options) Alternative 2 is the original proposal. 3. (Downsides) New. About breakage. 4. (Prior Art) New. 5. (Unresolved Problems) New. Included for discussion.
Even though I wanted to summarize everything here, some insightful comments are perhaps not included or misunderstood. These original comments can be found at the original feature request.
[0] https://gitlab.haskell.org/ghc/ghc/-/issues/20027
Motivation ==========
Unicode has been widely adopted and people around the world rely on Unicode to write in their native languages. Haskell, however, has been stuck in ASCII, and escape all non-ASCII characters in the String's instance of the Showclass, despite the fact that each element of a String is typically a Unicode code point, and putStrLn actually works as expected. Consider the following examples:
ghci> print "Hello, 世界” "Hello, \19990\30028”
ghci> print "Hello, мир” "Hello, \1084\1080\1088”
ghci> print "Hello, κόσμος” "Hello, \954\972\963\956\959\962”
ghci> "Hello, 世界" -- ghci calls `show`, so string literals are also escaped "Hello, \19990\30028”
ghci> "😀" -- Not only human scripts, but also emojis! "\128512”
This status quo is unsatisfactory for a number of reasons:
1. Even though it's small, it somehow creates an unwelcoming atmosphere for native speakers of languages whose scripts are not representable in ASCII. 2. This is an actual annoyance during debugging localized software, or strings with emojis. 3. Following 1, Haskell teachers are forced to use other languages instead of the students' mother tongues, or relying on I/O functions like putStrLn, creating a rather unnecessary burden. 4. Other string types, like Text [1], rely on this Show instance.
Moreover, `read` already can handle Unicode strings today, so relaxing constraints on `show` doesn't affect `read . show == id`.
Proposal ========
It's proposed here to change the Show instance of String, to achieve the following output:
ghci> print "Hello, 世界” "Hello, 世界”
ghci> print "Hello, мир” "Hello, мир”
ghci> print "Hello, κόσμος” "Hello, κόσμος”
ghci> "Hello, 世界” “Hello, 世界”
ghci> "😀” “😀"
More concretely, it means:
1. Modify a few guards in GHC.Show.showLitChar to not escape _readable_ Unicode characters out of the range of ASCII. 2. Provide a function showEscaped or newtype Escaped = Escaped String to obtain the current escaping behavior, in case anyone wants the current behavior back.
This proposal isn't about unescaping everything, but only readable Unicode characters. u_iswprint (GHC.Unicode.isPrint) seems to do the job, and indeed, there was a similar proposal before [2]. In summary, the behavior is similar to what Python `repr` does.
Alternative Options ===================
1. Always use putStrLn.
This is viable today but unsatisfactory as it requires stdout. In some cases, stdout is not accessible, e.g. Telegram or Discord bots.
2. Don't escape anything.
`show` itself refrains from escaping most of the characters, and let ghci do the job instead.
3. Customize ghci instead.
ghci intercepts output strings and check if they can be converted back to readable characters. This potentially allows for better compatibility with a variety of strangely behaving terminals, and finer-grained user control.
Tom Ellis proposed `-interactive-print`-based solutions in the comment section.
4. A new language extension, e.g. ShowStringUnicode.
Proposed by Julian Ospald. When enabled, readable Unicode characters are not escaped, and this is enabled by default by ghci. There are concerns about how this would affect cross-module behavior.
Downsides =========
This is definitely a breaking change, but the breakage, to our current understanding, is limited.
First, use of `show` in production code is discouraged. Even if someone really does that, the breakage only happens when one tries to send the "serialized" data over wire:
Suppose Machine A `show`-ed a string and saved it into a UTF-8-encoded file, and sends it to Machine B, which expects another encoding. This would be surprising for those who are used to the old behavior.
Second, though the breakage is not likely to be catastrophic for correct production code, test suites could be badly affected, as pointed out by Oleg Grenrus and vdukhovni in the comment section. Some test suites compare `show` results with expected results. vdukhovni further commented that Haskell escapes are not universally supported by non-Haskell tools, so the impact would be confined to Haskell.
Prior Art =========
Python supports Unicode natively since 3. Python's approach is intuitive and capable. Its `repr`, which is equivalent to Haskell's `show`, automatically escapes unreadable characters, but leaves readable characters unescaped. The criteria of "readable" can be found in CPython's code [3]. If we were to realize this proposal, Python could be a source of inspiration.
Unresolved Problems ===================
There are some currently unresolved (not discussed enough) issues.
+ Locales.
What if the specified locale does not support Unicode? Hécate Moonlight pointed out PEP-538 [4] could be a reference.
+ Unicode versions.
Javran Cheng pointed out u_iswprint is generated from a Unicode table, which is manually updated. This raises a concern that the definition of "printable" characters could change from version to version.
+ Definition of "readable".
Unicode already defined "printability". It's good, but it is not necessarily what we want here.
- Should we support RTL? - Should we design a Haskell-specific definition of readability, to avoid Unciode version silently introducing breakage?
(More?)
Some issues here perhaps require better answers to: What is our expectation of Show? Where should it be used? Should we expect it to break on every Unicode update?
[1] https://hackage.haskell.org/package/text-1.2.4.1/docs/src/Data.Text.Show.htm... [2] https://mail.haskell.org/pipermail/haskell-cafe/2016-February/122874.html [3] https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f797... [4] https://www.python.org/dev/peps/pep-0538/
_______________________________________________ Libraries mailing list Libraries@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries

Here is a simple patch, which I hope is close to what 1. Modify a few guards in GHC.Show.showLitChar to not escape _readable_ Unicode characters out of the range of ASCII. of a proposed change will look like: diff --git a/libraries/base/GHC/Show.hs b/libraries/base/GHC/Show.hs index 84077e473b..24569168d4 100644 --- a/libraries/base/GHC/Show.hs +++ b/libraries/base/GHC/Show.hs @@ -364,7 +364,10 @@ showCommaSpace = showString ", " -- > showLitChar '\n' s = "\\n" ++ s -- showLitChar :: Char -> ShowS -showLitChar c s | c > '\DEL' = showChar '\\' (protectEsc isDec (shows (ord c)) s) +showLitChar c s | c > '\DEL' = + if isPrint c + then showChar c s + else showChar '\\' (protectEsc isDec (shows (ord c)) s) showLitChar '\DEL' s = showString "\\DEL" s showLitChar '\\' s = showString "\\\\" s showLitChar c s | c >= ' ' = showChar c s @@ -380,6 +383,13 @@ showLitChar c s = showString ('\\' : asciiTab!!ord c) s -- I've done manual eta-expansion here, because otherwise it's -- impossible to stop (asciiTab!!ord) getting floated out as an MFE +-- Local definition of isPrint to avoid fighting with cycles for now. +isPrint :: Char -> Bool +isPrint c = iswprint (ord c) /= 0 + +foreign import ccall unsafe "u_iswprint" + iswprint :: Int -> Int + showLitString :: String -> ShowS -- | Same as 'showLitChar', but for strings -- It converts the string to a string using Haskell escape conventions I applied it to ghc-8.10 branch, % _build/stage1/bin/ghc --interactive GHCi, version 8.10.5: https://www.haskell.org/ghc/ :? for help Prelude> "äiti" "äiti" Prelude> "мир" "мир" Prelude> print "мир" "мир" Prelude> "😀" "😀" And then run test-suites of aeson, dhall and pandoc. Aeson test-suite passed. Dhall test-suites passed too, However pandoc testsuite failed: 78 out of 2819 tests failed (35.88s) An example failure is: 3587.md #1: FAIL (0.01s) --- test/command/3587.md +++ pandoc -f latex -t native + 1 [Para [Str "1 m",Space,Str "is",Space,Str "equal",Space,Str "to",Space,Str "1000 mm"]] - 1 [Para [Str "1\160m",Space,Str "is",Space,Str "equal",Space,Str "to",Space,Str "1000\160mm"]] Str is a constructor of Inline type, and takes Text: data Inline = Str Text | ... As discussed on the GHC issue [1], Text and ByteString Show Instances piggyback on String instance. Bodigrim said that Text will eventually migrate to do the same as new Show String [2], so this issue will resurface. Please explain the compatibility story. How library writes should write their code (in test-suites) which rely on Show String or Show Text, such that they could support GHC base versions (and/or text) versions on the both sides of this breaking change. I agree with Julian that required migration engineering effort across (even just the open source) ecosystem is non-trivial. Having a good plan would hopefully make it easier to accept that cost. The fact it's a change which is not detectable at compile time makes me very anxious about this, even I don't disagree with motivation bits. I have very little idea if and where I depend on Show String behavior. It would also be interesting to see results of test-suites of all Stackage, but I leave it for someone else to do. - Oleg [1]: https://gitlab.haskell.org/ghc/ghc/-/issues/20027 [2]: https://gitlab.haskell.org/ghc/ghc/-/issues/20027#note_363519 On 8.7.2021 15.25, Julian Ospald wrote:
Hi,
I think most seemed to agree on the motivation, but would it be a lot of work to ping a few large opensource/industry projects about this and get a feel what they think or how much of an expected effort a migration would be? I'm afraid that we might take this too lightly and possibly cause a lot of engineering effort here. Our expectations how or how often people use "show" might or might not be accurate.
I'm aware of e.g. the cardano wallet test suite (open source) and other cardano projects that are very large opon source codebases and may be affected.
CCing duncan
On July 8, 2021 10:11:28 AM UTC, Kai Ma
wrote: Hi all
Two weeks ago, I proposed “Support Unicode characters in instance Show String” [0] in the GHC issue tracker, and chessai asked me to post it here for wider feedback. The proposal posted here is edited to reflect new ideas proposed and insights accumulated over the days:
1. (Proposal) Now the proposal itself is now modeled after Python. 2. (Alternative Options) Alternative 2 is the original proposal. 3. (Downsides) New. About breakage. 4. (Prior Art) New. 5. (Unresolved Problems) New. Included for discussion.
Even though I wanted to summarize everything here, some insightful comments are perhaps not included or misunderstood. These original comments can be found at the original feature request.
[0] https://gitlab.haskell.org/ghc/ghc/-/issues/20027 https://gitlab.haskell.org/ghc/ghc/-/issues/20027
Motivation ------------------------------------------------------------------------ Unicode has been widely adopted and people around the world rely on Unicode to write in their native languages. Haskell, however, has been stuck in ASCII, and escape all non-ASCII characters in the String's instance of the Showclass, despite the fact that each element of a String is typically a Unicode code point, and putStrLn actually works as expected. Consider the following examples:
ghci> print "Hello, 世界” "Hello, \19990\30028”
ghci> print "Hello, мир” "Hello, \1084\1080\1088”
ghci> print "Hello, κόσμος” "Hello, \954\972\963\956\959\962”
ghci> "Hello, 世界" -- ghci calls `show`, so string literals are also escaped "Hello, \19990\30028”
ghci> "😀" -- Not only human scripts, but also emojis! "\128512”
This status quo is unsatisfactory for a number of reasons:
1. Even though it's small, it somehow creates an unwelcoming atmosphere for native speakers of languages whose scripts are not representable in ASCII. 2. This is an actual annoyance during debugging localized software, or strings with emojis. 3. Following 1, Haskell teachers are forced to use other languages instead of the students' mother tongues, or relying on I/O functions like putStrLn, creating a rather unnecessary burden. 4. Other string types, like Text [1], rely on this Show instance.
Moreover, `read` already can handle Unicode strings today, so relaxing constraints on `show` doesn't affect `read . show == id`.
Proposal ------------------------------------------------------------------------ It's proposed here to change the Show instance of String, to achieve the following output:
ghci> print "Hello, 世界” "Hello, 世界”
ghci> print "Hello, мир” "Hello, мир”
ghci> print "Hello, κόσμος” "Hello, κόσμος”
ghci> "Hello, 世界” “Hello, 世界”
ghci> "😀” “😀"
More concretely, it means:
1. Modify a few guards in GHC.Show.showLitChar to not escape _readable_ Unicode characters out of the range of ASCII. 2. Provide a function showEscaped or newtype Escaped = Escaped String to obtain the current escaping behavior, in case anyone wants the current behavior back.
This proposal isn't about unescaping everything, but only readable Unicode characters. u_iswprint (GHC.Unicode.isPrint) seems to do the job, and indeed, there was a similar proposal before [2]. In summary, the behavior is similar to what Python `repr` does.
Alternative Options ------------------------------------------------------------------------ 1. Always use putStrLn.
This is viable today but unsatisfactory as it requires stdout. In some cases, stdout is not accessible, e.g. Telegram or Discord bots.
2. Don't escape anything.
`show` itself refrains from escaping most of the characters, and let ghci do the job instead.
3. Customize ghci instead.
ghci intercepts output strings and check if they can be converted back to readable characters. This potentially allows for better compatibility with a variety of strangely behaving terminals, and finer-grained user control.
Tom Ellis proposed `-interactive-print`-based solutions in the comment section.
4. A new language extension, e.g. ShowStringUnicode.
Proposed by Julian Ospald. When enabled, readable Unicode characters are not escaped, and this is enabled by default by ghci. There are concerns about how this would affect cross-module behavior.
Downsides ------------------------------------------------------------------------ This is definitely a breaking change, but the breakage, to our current understanding, is limited.
First, use of `show` in production code is discouraged. Even if someone really does that, the breakage only happens when one tries to send the "serialized" data over wire:
Suppose Machine A `show`-ed a string and saved it into a UTF-8-encoded file, and sends it to Machine B, which expects another encoding. This would be surprising for those who are used to the old behavior.
Second, though the breakage is not likely to be catastrophic for correct production code, test suites could be badly affected, as pointed out by Oleg Grenrus and vdukhovni in the comment section. Some test suites compare `show` results with expected results. vdukhovni further commented that Haskell escapes are not universally supported by non-Haskell tools, so the impact would be confined to Haskell.
Prior Art ------------------------------------------------------------------------ Python supports Unicode natively since 3. Python's approach is intuitive and capable. Its `repr`, which is equivalent to Haskell's `show`, automatically escapes unreadable characters, but leaves readable characters unescaped. The criteria of "readable" can be found in CPython's code [3]. If we were to realize this proposal, Python could be a source of inspiration.
Unresolved Problems ------------------------------------------------------------------------ There are some currently unresolved (not discussed enough) issues.
+ Locales.
What if the specified locale does not support Unicode? Hécate Moonlight pointed out PEP-538 [4] could be a reference.
+ Unicode versions.
Javran Cheng pointed out u_iswprint is generated from a Unicode table, which is manually updated. This raises a concern that the definition of "printable" characters could change from version to version.
+ Definition of "readable".
Unicode already defined "printability". It's good, but it is not necessarily what we want here.
- Should we support RTL? - Should we design a Haskell-specific definition of readability, to avoid Unciode version silently introducing breakage?
(More?)
Some issues here perhaps require better answers to: What is our expectation of Show? Where should it be used? Should we expect it to break on every Unicode update?
[1] https://hackage.haskell.org/package/text-1.2.4.1/docs/src/Data.Text.Show.htm... https://hackage.haskell.org/package/text-1.2.4.1/docs/src/Data.Text.Show.htm... [2] https://mail.haskell.org/pipermail/haskell-cafe/2016-February/122874.html https://mail.haskell.org/pipermail/haskell-cafe/2016-February/122874.html [3] https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f797... https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f797... [4] https://www.python.org/dev/peps/pep-0538/ https://www.python.org/dev/peps/pep-0538/ ------------------------------------------------------------------------ Libraries mailing list Libraries@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
_______________________________________________ Libraries mailing list Libraries@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries

On 08/07/2021 17.53, Oleg Grenrus wrote: [--snip--]
78 out of 2819 tests failed (35.88s)
The use of Show in 'golden' test suites is interesting because Show doesn't really guarantee any real form of stability in its output. I guess Hyrum's Law applies here too. Anyway... just an idle observation. Obviously, breaking loads of test suites is going to be hard to swallow. Regards,

Yes, I think someone already mentioned that Haskell Report 2010 says for Data.Char: showLitChar :: Char -> ShowS Convert a character to a string using only printable characters, using Haskell source-language escape conventions. For example: showLitChar '\n' s = "\\n" ++ s With the isPrint :: Char -> Bool Selects printable Unicode characters (letters, numbers, marks, punctuation, symbols and spaces). being one obvious, yet unused, definition. Also if one looks at the history of GHC, the showLitChar have been essentially unchanged since 2001 when it was introduced to the source tree. So while technically it won't be deviation from the report, I don't think "technically correct" is necessarily correct here. - Oleg On 8.7.2021 19.38, Bardur Arantsson wrote:
On 08/07/2021 17.53, Oleg Grenrus wrote:
[--snip--]
78 out of 2819 tests failed (35.88s)
The use of Show in 'golden' test suites is interesting because Show doesn't really guarantee any real form of stability in its output.
I guess Hyrum's Law applies here too.
Anyway... just an idle observation. Obviously, breaking loads of test suites is going to be hard to swallow.
Regards,
_______________________________________________ Libraries mailing list Libraries@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries

I guess this is the perfect time to come up with a Render typeclass that targets end-users rather than satisfying 'read . show = id'. chessai: Could the CLC appoint someone to start working on something? This would be pretty useful to unify the ecosystem (instead of having custom Outputable, Render, etc). Le 08/07/2021 à 18:38, Bardur Arantsson a écrit :
On 08/07/2021 17.53, Oleg Grenrus wrote:
[--snip--]
78 out of 2819 tests failed (35.88s)
The use of Show in 'golden' test suites is interesting because Show doesn't really guarantee any real form of stability in its output.
I guess Hyrum's Law applies here too.
Anyway... just an idle observation. Obviously, breaking loads of test suites is going to be hard to swallow.
Regards,
_______________________________________________ Libraries mailing list Libraries@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
-- Hécate ✨ 🐦: @TechnoEmpress IRC: Hecate WWW: https://glitchbra.in RUN: BSD

On Thu, 8 Jul 2021, Hécate wrote:
I guess this is the perfect time to come up with a Render typeclass that targets end-users rather than satisfying 'read . show = id'.
chessai: Could the CLC appoint someone to start working on something? This would be pretty useful to unify the ecosystem (instead of having custom Outputable, Render, etc).
The question is, whether such a one-fits-all library actually serves the needs of the users. Actually there is already 'printf' which is intended for formatting end-user output. You can write your own PrintfArg instances. Today I am using custom Format classes per project.

From what I can see from the various typeclasses that exist: HaTeX's Render https://hackage.haskell.org/package/HaTeX-3.22.3.0/docs/Text-LaTeX-Base-Rend...:
Class of values that can be transformed to |Text https://hackage.haskell.org/package/HaTeX-3.22.3.0/docs/Text-LaTeX-Base-Rend...|.
Types which can be rendered "prettily", that is, formatted by a
core-text's Render https://hackage.haskell.org/package/core-text-0.3.0.0/docs/Core-Text-Utiliti...: pretty printer and embossed with beautiful ANSI colours when printed to the terminal. ttc's Render https://hackage.haskell.org/package/ttc-1.1.0.1/docs/Data-TTC.html#t:Render:
The Render type class renders a data type as a textual data type.
When defining such classes, people don't seem to enforce much, and on the opposite seem to loosely define the requirements, allowing for ANSI colours for instance. Since we have a mechanism to produce our own instances through newtype wrappers and use Generalised Newtype Deriving or Deriving Via, I don't see much of a problem at this point (of course I only ask to be proven wrong). One example in another language is Rust, which has the Debug https://doc.rust-lang.org/core/fmt/trait.Debug.html and Display https://doc.rust-lang.org/core/fmt/trait.Display.html traits, Debug being akin to our Show, and Display having the following property
Display is similar to Debug, but Display is for user-facing output, and so cannot be derived.
It also interesting to note that these traits belong to a wider family of formatting traits https://doc.rust-lang.org/core/fmt/index.html used with their built-in syntax of formatting (Binary, Debug, Lower & Upper Hex, Display, Octal, Write, etc). ––– Henning, you are right, `printf` does exist, and it looks pretty rich in terms of features and customisation. I acknowledge that not using it is mostly a cultural matter, and that it can be fixed by improving our current pedagogical material. If we decide as a community that a typeclass is actually not the proper tool to get away from Show instances, then I will welcome with open arms blog posts & tutorials to guide our community towards this solution, and promote them. Le 08/07/2021 à 20:25, Henning Thielemann a écrit :
On Thu, 8 Jul 2021, Hécate wrote:
I guess this is the perfect time to come up with a Render typeclass that targets end-users rather than satisfying 'read . show = id'.
chessai: Could the CLC appoint someone to start working on something? This would be pretty useful to unify the ecosystem (instead of having custom Outputable, Render, etc).
The question is, whether such a one-fits-all library actually serves the needs of the users. Actually there is already 'printf' which is intended for formatting end-user output. You can write your own PrintfArg instances. Today I am using custom Format classes per project.
-- Hécate ✨ 🐦: @TechnoEmpress IRC: Hecate WWW: https://glitchbra.in RUN: BSD

On Thu, Jul 08, 2021 at 09:02:29PM +0200, Hécate wrote:
From what I can see from the various typeclasses that exist:
Sadly, no new or extant alternative type classes can address the problem, simply because they're NOT `Show`, which is the default interface for inspecting the content of a term, and so is the only one at all likely to be available for the terms of interest. Yes, we should also consider having more type classes for presentation in forms other than compiler-friendly source syntax. Thus, e.g. for DNS data, one might have `Show` for debugging, but `Present` for textual presentations of DNS resource records specified in the multitude of DNS-related RFCs and used as the input syntax in DNS zone files. (Thus a DNS RR would have a binary wire form, an ASCII ByteString presentation form, a `Show` instance that exposes all the constructors, ..., and in the case of domain names also a U-label form for display of domain names to users more comfortable with the native script). Multiple applicable serialisation formats are to be expected, and yet I think that `Show` should still avoid escaping unicode printable text... I don't see a plausible path of getting most libraries that implement `Show` to start implemeting a second parallel class for friendly display. Unless you're thinking overlapping instances: instance Show a => Display a where display = show instance Display String where display = displayString -- Viktor.

I don't see a plausible path of getting most libraries that implement `Show` to start implemeting a second parallel class for friendly display.
* Extensive publicity of this typeclass in community (social media, etc) * Stock deriving (implemented by GHC & the CLC) * First-class place in documentation (Documentation Task Force of the HF) * Coordinated effort with popular libraries to implement its adoption (CLC + stakeholders) * Maybe a code-modding script using retrie https://hackage.haskell.org/package/retrie. We will have to work as a community on that one but I am convinced that this is doable. Le 08/07/2021 à 21:18, Viktor Dukhovni a écrit :
On Thu, Jul 08, 2021 at 09:02:29PM +0200, Hécate wrote:
From what I can see from the various typeclasses that exist:
Sadly, no new or extant alternative type classes can address the problem, simply because they're NOT `Show`, which is the default interface for inspecting the content of a term, and so is the only one at all likely to be available for the terms of interest.
Yes, we should also consider having more type classes for presentation in forms other than compiler-friendly source syntax.
Thus, e.g. for DNS data, one might have `Show` for debugging, but `Present` for textual presentations of DNS resource records specified in the multitude of DNS-related RFCs and used as the input syntax in DNS zone files. (Thus a DNS RR would have a binary wire form, an ASCII ByteString presentation form, a `Show` instance that exposes all the constructors, ..., and in the case of domain names also a U-label form for display of domain names to users more comfortable with the native script).
Multiple applicable serialisation formats are to be expected, and yet I think that `Show` should still avoid escaping unicode printable text...
I don't see a plausible path of getting most libraries that implement `Show` to start implemeting a second parallel class for friendly display.
Unless you're thinking overlapping instances:
instance Show a => Display a where display = show
instance Display String where display = displayString
-- Hécate ✨ 🐦: @TechnoEmpress IRC: Hecate WWW: https://glitchbra.in RUN: BSD

On Thu, 8 Jul 2021, Hécate wrote:
I don't see a plausible path of getting most libraries that implement `Show` to start implemeting a second parallel class for friendly display.
* Extensive publicity of this typeclass in community (social media, etc) * Stock deriving (implemented by GHC & the CLC)
That works for Show, because Show is expected to display Haskell code. But how would you automatically derive human readable text from an arbitrary algebraic data type?
* First-class place in documentation (Documentation Task Force of the HF) * Coordinated effort with popular libraries to implement its adoption (CLC + stakeholders) * Maybe a code-modding script using retrie.
We will have to work as a community on that one but I am convinced that this is doable.

You can literally piggyback on Show's machinery for generic deriving? I don't understand the issue here. If the output is not easily-read by humans, then implement the instance yourself? Le 08/07/2021 à 22:01, Henning Thielemann a écrit :
On Thu, 8 Jul 2021, Hécate wrote:
I don't see a plausible path of getting most libraries that implement `Show` to start implemeting a second parallel class for friendly display.
* Extensive publicity of this typeclass in community (social media, etc) * Stock deriving (implemented by GHC & the CLC)
That works for Show, because Show is expected to display Haskell code. But how would you automatically derive human readable text from an arbitrary algebraic data type?
* First-class place in documentation (Documentation Task Force of the HF) * Coordinated effort with popular libraries to implement its adoption (CLC + stakeholders) * Maybe a code-modding script using retrie.
We will have to work as a community on that one but I am convinced that this is doable.
-- Hécate ✨ 🐦: @TechnoEmpress IRC: Hecate WWW: https://glitchbra.in RUN: BSD

On Thu, Jul 08, 2021 at 11:07:36PM +0200, Hécate wrote:
You can literally piggyback on Show's machinery for generic deriving? I don't understand the issue here. If the output is not easily-read by humans, then implement the instance yourself?
Indeed a `Render` (module bikeshedding the name) class can be derived in much the same way as `Show`, but might not be as friendly in many cases, because generically it would emit constructor and field names that might be omitted in a more friendly rendition of the structure. So `Render` would warrant a specialised instance in more cases. One might perhaps `render` a date in ISO 8601 format, skipping the constructor names. If, for example, it was decided that a DNS Resource Record should `render` to its RFC Presentation form, then one might want to render a list of resource records (an RRSet) by just joining newline separated strings: > putStrLn $ render $ [ { rname = "example.com." , rclass = IN , rdata = RData { T_A "192.0.2.1" } , { rname = "example.com." , rclass = IN , rdata = RData { T_A "127.0.0.1" } ] example.com. IN A 192.0.2.1 example.com. IN A 127.0.0.1 > rather than as: ["example.com. IN A 192.0.2.1","example.com. IN A 127.0.0.1"] taking advantage of the possibility of custom presentation of lists. It may be worth noting that on a smaller, incremental scale, once `Render` is established, there would be incentives to fix some of the non-lawful `Show` instances, which would introduce transient breakage for the consumers of the affected packages. -- Viktor.

It may be worth noting that on a smaller, incremental scale, once `Render` is established, there would be incentives to fix some of the non-lawful `Show` instances, which would introduce transient breakage for the consumers of the affected packages.
GHC releases frequently fix situations where code that was previously accepted will be rejected, so it feels normal that we decide to fix unlawful Show instances while providing a more adapted solution for human-readable outputs. :) Le 08/07/2021 à 23:37, Viktor Dukhovni a écrit :
On Thu, Jul 08, 2021 at 11:07:36PM +0200, Hécate wrote:
You can literally piggyback on Show's machinery for generic deriving? I don't understand the issue here. If the output is not easily-read by humans, then implement the instance yourself? Indeed a `Render` (module bikeshedding the name) class can be derived in much the same way as `Show`, but might not be as friendly in many cases, because generically it would emit constructor and field names that might be omitted in a more friendly rendition of the structure.
So `Render` would warrant a specialised instance in more cases.
One might perhaps `render` a date in ISO 8601 format, skipping the constructor names.
If, for example, it was decided that a DNS Resource Record should `render` to its RFC Presentation form, then one might want to render a list of resource records (an RRSet) by just joining newline separated strings:
> putStrLn $ render $ [ { rname = "example.com." , rclass = IN , rdata = RData { T_A "192.0.2.1" } , { rname = "example.com." , rclass = IN , rdata = RData { T_A "127.0.0.1" } ] example.com. IN A 192.0.2.1 example.com. IN A 127.0.0.1 >
rather than as:
["example.com. IN A 192.0.2.1","example.com. IN A 127.0.0.1"]
taking advantage of the possibility of custom presentation of lists.
It may be worth noting that on a smaller, incremental scale, once `Render` is established, there would be incentives to fix some of the non-lawful `Show` instances, which would introduce transient breakage for the consumers of the affected packages.
-- Hécate ✨ 🐦: @TechnoEmpress IRC: Hecate WWW: https://glitchbra.in RUN: BSD

Let me second the idea of a Render class—it's something I've wanted repeatedly. There is a clear tension between Show's requirements and human-readable outputs: strings get less readable, we can't summarize/truncate large values, it can't handle functions... At the same time, Show is special because it is ubiquitous. We can derive Show automatically, and it's one of the few classes we can expect almost every data type to implement. Project-specific formatting classes cannot fulfill the same role as Show; they require a non-trivial setup cost (which especially compromises the beginner experience!) and they cannot be used by tools and libraries that are not project-specific. What can a library author do to make their types more readable and usable to their users? Today the best lever is the Show instance, which is why I've seen substantially more "unlawful" Show instances in the wild compared to any other base typeclass. Other languages like Python and Rust have a similar split. I've been doing a fair amount of data sciencey Python lately, and I can tell you that the experience at the normal interpreter (not even talking about Jupyter) is definitely better than in Haskell because things like dataframes are rendered in a human-readable way, formatted as a table and truncated. Having a human-readable class is a non-trivial proposal on its own, and normally I wouldn't want to link it to a different change. In this case, however, there is a clear overlap: a new Render class lets us have two different to-string behaviors in parallel, which solves the problem in a similar way to the other alternatives proposed like a language extension or changing GHCi. On Thu, Jul 8, 2021 at 11:27 AM Henning Thielemann < lemming@henning-thielemann.de> wrote:
On Thu, 8 Jul 2021, Hécate wrote:
I guess this is the perfect time to come up with a Render typeclass that targets end-users rather than satisfying 'read . show = id'.
chessai: Could the CLC appoint someone to start working on something? This would be pretty useful to unify the ecosystem (instead of having custom Outputable, Render, etc).
The question is, whether such a one-fits-all library actually serves the needs of the users. Actually there is already 'printf' which is intended for formatting end-user output. You can write your own PrintfArg instances. Today I am using custom Format classes per project._______________________________________________ Libraries mailing list Libraries@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries

GHC Manual: Using a custom interactive printing function [1] Prelude Data.Char> "foo" "foo" Prelude Data.Char> let myprint=putStrLn . map toUpper . show Prelude Data.Char> :set -interactive-print myprint Prelude Data.Char> "foo" "FOO" I remember some people using e.g. pretty-show or pretty-simple packages with great success. No need to change anything in base, just make users aware of a configuration possibility. Also this allows people to actually experiment and come up with good design for a library specifically tailored for interactive printing. For example, Python's repr implementations are often configurable, as in pandas AFAIU, as one-size doesn't fit all tastes. (Neat case for implicit params in Haskell?) I'm not in support of a new class in base, without prior art. Changing stuff in base is difficult, better to get it (quite) right. I'm undecided on the original proposal itself, I'll wait for OP to answer my questions first, and amend the proposal if needed. [1]: https://downloads.haskell.org/~ghc/9.0.1/docs/html/users_guide/ghci.html#usi... On 8.7.2021 22.21, Tikhon Jelvis wrote:
Let me second the idea of a Render class—it's something I've wanted repeatedly. There is a clear tension between Show's requirements and human-readable outputs: strings get less readable, we can't summarize/truncate large values, it can't handle functions...
At the same time, Show is special because it is ubiquitous. We can derive Show automatically, and it's one of the few classes we can expect almost every data type to implement. Project-specific formatting classes cannot fulfill the same role as Show; they require a non-trivial setup cost (which especially compromises the beginner experience!) and they cannot be used by tools and libraries that are not project-specific. What can a library author do to make their types more readable and usable to their users? Today the best lever is the Show instance, which is why I've seen substantially more "unlawful" Show instances in the wild compared to any other base typeclass.
Other languages like Python and Rust have a similar split. I've been doing a fair amount of data sciencey Python lately, and I can tell you that the experience at the normal interpreter (not even talking about Jupyter) is definitely better than in Haskell because things like dataframes are rendered in a human-readable way, formatted as a table and truncated.
Having a human-readable class is a non-trivial proposal on its own, and normally I wouldn't want to link it to a different change. In this case, however, there is a clear overlap: a new Render class lets us have two different to-string behaviors in parallel, which solves the problem in a similar way to the other alternatives proposed like a language extension or changing GHCi.
On Thu, Jul 8, 2021 at 11:27 AM Henning Thielemann
mailto:lemming@henning-thielemann.de> wrote: On Thu, 8 Jul 2021, Hécate wrote:
> I guess this is the perfect time to come up with a Render typeclass that > targets end-users rather than satisfying 'read . show = id'. > > chessai: Could the CLC appoint someone to start working on something? > This would be pretty useful to unify the ecosystem (instead of having > custom Outputable, Render, etc).
The question is, whether such a one-fits-all library actually serves the needs of the users. Actually there is already 'printf' which is intended for formatting end-user output. You can write your own PrintfArg instances. Today I am using custom Format classes per project._______________________________________________ Libraries mailing list Libraries@haskell.org mailto:Libraries@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
_______________________________________________ Libraries mailing list Libraries@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries

On Thu, 8 Jul 2021, Oleg Grenrus wrote:
GHC Manual: Using a custom interactive printing function [1]
Prelude Data.Char> "foo" "foo" Prelude Data.Char> let myprint=putStrLn . map toUpper . show Prelude Data.Char> :set -interactive-print myprint Prelude Data.Char> "foo" "FOO"
I remember some people using e.g. pretty-show or pretty-simple packages with great success.
That's great!
No need to change anything in base, just make users aware of a configuration possibility.
I missed that feature, too.

On Thu, Jul 08, 2021 at 10:48:28PM +0300, Oleg Grenrus wrote:
GHC Manual: Using a custom interactive printing function [1]
Prelude Data.Char> "foo" "foo" Prelude Data.Char> let myprint=putStrLn . map toUpper . show Prelude Data.Char> :set -interactive-print myprint Prelude Data.Char> "foo" "FOO"
I remember some people using e.g. pretty-show or pretty-simple packages with great success.
No need to change anything in base, just make users aware of a configuration possibility.
This is NOT viable, not all escapes are valid to unescape after the fact in a print function. Which pieces of the `show` output are escaped unicode characters, and which are not depends on the original data type that got serialised to a given `String`. For example, in a `ByteString` the octet '\xFC' should display as '\252', while in a `String` or `Text` U+00FC should display as 'ü'. Thus, there is no correct unescape function for the string "M\\252ller", it could be "Müller", or not... -- Viktor.

On Fri, Jul 9, 2021 at 3:18 AM Hécate wrote:
I guess this is the perfect time to come up with a Render typeclass that targets end-users rather than satisfying 'read . show = id'.
This is the primary motivation of the TTC (Textual Type Classes) library: https://hackage.haskell.org/package/ttc. It provides a Render type class that is analogous to Show and a Parse type class that is analogous to Read. No instances are declared so that users of the library can implement instances as required for each application, but default instances for core types are available. Travis

On 08-07-21 19:04, Travis Cardwell via Libraries wrote:
This is the primary motivation of the TTC (Textual Type Classes) library: https://hackage.haskell.org/package/ttc.
It provides a Render type class that is analogous to Show and a Parse type class that is analogous to Read. No instances are declared so that users of the library can implement instances as required for each application, but default instances for core types are available.
Not related to the main point of the thread, but we really need a way to signal certain libraries as the "go to solution" for a problem without including them in "base". A note on the Show/Read class documentation saying that the best solution for "showing" human text is certain library would be great. -- -- Rubén -- pgp: 4EE9 28F7 932E F4AD

On Fri, Jul 09, 2021 at 08:04:40AM +0900, Travis Cardwell via Libraries wrote:
On Fri, Jul 9, 2021 at 3:18 AM Hécate wrote:
I guess this is the perfect time to come up with a Render typeclass that targets end-users rather than satisfying 'read . show = id'.
This is the primary motivation of the TTC (Textual Type Classes) library: https://hackage.haskell.org/package/ttc.
It provides a Render type class that is analogous to Show and a Parse type class that is analogous to Read. No instances are declared so that users of the library can implement instances as required for each application, but default instances for core types are available.
One promising feature of this is that the target type is not restricted to just `String`, but rather `Render` produces a `Textual` value, which can be a String, Text, ByteString, or a Text or Binary `builder`. I would prefer to also see a ByteString `Builder` as an option, but the high level design feels sound. -- Viktor.

It would also be good to have a summary of of previous discussion OP kindly linked [1], e.g. the comment by David Turner [2]
One of the most visible uses of Show is that it's how values are shown in GHCi. As mentioned earlier in this thread, if you're teaching in a non-ASCII language then the user experience is pretty poor.
On the other hand, I see Show (like .ToString() in C# etc.) as a debugging tool: not for seriously robust serialisation but useful if you need to dump a value into a log message or email or similar. And in that situation it's very useful if it sticks to ASCII: non-ASCII content just isn't resilient enough to being passed around the network, truncated and generally mutilated on the way through.
These are definitely two different concerns and they pull in opposite directions in this discussion. It's a matter of opinion which you think is more important. Me, I think the latter, but then I do a lot of logging and speak a language that fits into ASCII. YMMV!
This proposal is motivated by the first point, but doesn't mention debugging other then
2. This is an actual annoyance during debugging localized software, or strings with emojis
which I don't agree with. For example look at the failing test case in the pandoc in my previous message. \160 is a non-breaking space, which looks like normal space when rendered normally. I have my share of bad experience with it. So, indeed YMMV. - Oleg [1]: https://mail.haskell.org/pipermail/haskell-cafe/2016-February/122874.html [2]: https://mail.haskell.org/pipermail/haskell-cafe/2016-February/122899.html On 8.7.2021 18.53, Oleg Grenrus wrote:
Here is a simple patch, which I hope is close to what
1. Modify a few guards in GHC.Show.showLitChar to not escape _readable_ Unicode characters out of the range of ASCII.
of a proposed change will look like:
diff --git a/libraries/base/GHC/Show.hs b/libraries/base/GHC/Show.hs index 84077e473b..24569168d4 100644 --- a/libraries/base/GHC/Show.hs +++ b/libraries/base/GHC/Show.hs @@ -364,7 +364,10 @@ showCommaSpace = showString ", " -- > showLitChar '\n' s = "\\n" ++ s -- showLitChar :: Char -> ShowS -showLitChar c s | c > '\DEL' = showChar '\\' (protectEsc isDec (shows (ord c)) s) +showLitChar c s | c > '\DEL' = + if isPrint c + then showChar c s + else showChar '\\' (protectEsc isDec (shows (ord c)) s) showLitChar '\DEL' s = showString "\\DEL" s showLitChar '\\' s = showString "\\\\" s showLitChar c s | c >= ' ' = showChar c s @@ -380,6 +383,13 @@ showLitChar c s = showString ('\\' : asciiTab!!ord c) s -- I've done manual eta-expansion here, because otherwise it's -- impossible to stop (asciiTab!!ord) getting floated out as an MFE +-- Local definition of isPrint to avoid fighting with cycles for now. +isPrint :: Char -> Bool +isPrint c = iswprint (ord c) /= 0 + +foreign import ccall unsafe "u_iswprint" + iswprint :: Int -> Int + showLitString :: String -> ShowS -- | Same as 'showLitChar', but for strings -- It converts the string to a string using Haskell escape conventions
I applied it to ghc-8.10 branch,
% _build/stage1/bin/ghc --interactive GHCi, version 8.10.5: https://www.haskell.org/ghc/ :? for help Prelude> "äiti" "äiti" Prelude> "мир" "мир" Prelude> print "мир" "мир" Prelude> "😀" "😀"
And then run test-suites of aeson, dhall and pandoc.
Aeson test-suite passed. Dhall test-suites passed too, However pandoc testsuite failed:
78 out of 2819 tests failed (35.88s)
An example failure is:
3587.md #1: FAIL (0.01s) --- test/command/3587.md +++ pandoc -f latex -t native + 1 [Para [Str "1 m",Space,Str "is",Space,Str "equal",Space,Str "to",Space,Str "1000 mm"]] - 1 [Para [Str "1\160m",Space,Str "is",Space,Str "equal",Space,Str "to",Space,Str "1000\160mm"]]
Str is a constructor of Inline type, and takes Text: data Inline = Str Text | ... As discussed on the GHC issue [1], Text and ByteString Show Instances piggyback on String instance. Bodigrim said that Text will eventually migrate to do the same as new Show String [2], so this issue will resurface.
Please explain the compatibility story. How library writes should write their code (in test-suites) which rely on Show String or Show Text, such that they could support GHC base versions (and/or text) versions on the both sides of this breaking change.
I agree with Julian that required migration engineering effort across (even just the open source) ecosystem is non-trivial. Having a good plan would hopefully make it easier to accept that cost.
The fact it's a change which is not detectable at compile time makes me very anxious about this, even I don't disagree with motivation bits. I have very little idea if and where I depend on Show String behavior.
It would also be interesting to see results of test-suites of all Stackage, but I leave it for someone else to do.
- Oleg
[1]: https://gitlab.haskell.org/ghc/ghc/-/issues/20027 [2]: https://gitlab.haskell.org/ghc/ghc/-/issues/20027#note_363519
On 8.7.2021 15.25, Julian Ospald wrote:
Hi,
I think most seemed to agree on the motivation, but would it be a lot of work to ping a few large opensource/industry projects about this and get a feel what they think or how much of an expected effort a migration would be? I'm afraid that we might take this too lightly and possibly cause a lot of engineering effort here. Our expectations how or how often people use "show" might or might not be accurate.
I'm aware of e.g. the cardano wallet test suite (open source) and other cardano projects that are very large opon source codebases and may be affected.
CCing duncan
On July 8, 2021 10:11:28 AM UTC, Kai Ma
wrote: Hi all
Two weeks ago, I proposed “Support Unicode characters in instance Show String” [0] in the GHC issue tracker, and chessai asked me to post it here for wider feedback. The proposal posted here is edited to reflect new ideas proposed and insights accumulated over the days:
1. (Proposal) Now the proposal itself is now modeled after Python. 2. (Alternative Options) Alternative 2 is the original proposal. 3. (Downsides) New. About breakage. 4. (Prior Art) New. 5. (Unresolved Problems) New. Included for discussion.
Even though I wanted to summarize everything here, some insightful comments are perhaps not included or misunderstood. These original comments can be found at the original feature request.
[0] https://gitlab.haskell.org/ghc/ghc/-/issues/20027 https://gitlab.haskell.org/ghc/ghc/-/issues/20027
Motivation ------------------------------------------------------------------------ Unicode has been widely adopted and people around the world rely on Unicode to write in their native languages. Haskell, however, has been stuck in ASCII, and escape all non-ASCII characters in the String's instance of the Showclass, despite the fact that each element of a String is typically a Unicode code point, and putStrLn actually works as expected. Consider the following examples:
ghci> print "Hello, 世界” "Hello, \19990\30028”
ghci> print "Hello, мир” "Hello, \1084\1080\1088”
ghci> print "Hello, κόσμος” "Hello, \954\972\963\956\959\962”
ghci> "Hello, 世界" -- ghci calls `show`, so string literals are also escaped "Hello, \19990\30028”
ghci> "😀" -- Not only human scripts, but also emojis! "\128512”
This status quo is unsatisfactory for a number of reasons:
1. Even though it's small, it somehow creates an unwelcoming atmosphere for native speakers of languages whose scripts are not representable in ASCII. 2. This is an actual annoyance during debugging localized software, or strings with emojis. 3. Following 1, Haskell teachers are forced to use other languages instead of the students' mother tongues, or relying on I/O functions like putStrLn, creating a rather unnecessary burden. 4. Other string types, like Text [1], rely on this Show instance.
Moreover, `read` already can handle Unicode strings today, so relaxing constraints on `show` doesn't affect `read . show == id`.
Proposal ------------------------------------------------------------------------ It's proposed here to change the Show instance of String, to achieve the following output:
ghci> print "Hello, 世界” "Hello, 世界”
ghci> print "Hello, мир” "Hello, мир”
ghci> print "Hello, κόσμος” "Hello, κόσμος”
ghci> "Hello, 世界” “Hello, 世界”
ghci> "😀” “😀"
More concretely, it means:
1. Modify a few guards in GHC.Show.showLitChar to not escape _readable_ Unicode characters out of the range of ASCII. 2. Provide a function showEscaped or newtype Escaped = Escaped String to obtain the current escaping behavior, in case anyone wants the current behavior back.
This proposal isn't about unescaping everything, but only readable Unicode characters. u_iswprint (GHC.Unicode.isPrint) seems to do the job, and indeed, there was a similar proposal before [2]. In summary, the behavior is similar to what Python `repr` does.
Alternative Options ------------------------------------------------------------------------ 1. Always use putStrLn.
This is viable today but unsatisfactory as it requires stdout. In some cases, stdout is not accessible, e.g. Telegram or Discord bots.
2. Don't escape anything.
`show` itself refrains from escaping most of the characters, and let ghci do the job instead.
3. Customize ghci instead.
ghci intercepts output strings and check if they can be converted back to readable characters. This potentially allows for better compatibility with a variety of strangely behaving terminals, and finer-grained user control.
Tom Ellis proposed `-interactive-print`-based solutions in the comment section.
4. A new language extension, e.g. ShowStringUnicode.
Proposed by Julian Ospald. When enabled, readable Unicode characters are not escaped, and this is enabled by default by ghci. There are concerns about how this would affect cross-module behavior.
Downsides ------------------------------------------------------------------------ This is definitely a breaking change, but the breakage, to our current understanding, is limited.
First, use of `show` in production code is discouraged. Even if someone really does that, the breakage only happens when one tries to send the "serialized" data over wire:
Suppose Machine A `show`-ed a string and saved it into a UTF-8-encoded file, and sends it to Machine B, which expects another encoding. This would be surprising for those who are used to the old behavior.
Second, though the breakage is not likely to be catastrophic for correct production code, test suites could be badly affected, as pointed out by Oleg Grenrus and vdukhovni in the comment section. Some test suites compare `show` results with expected results. vdukhovni further commented that Haskell escapes are not universally supported by non-Haskell tools, so the impact would be confined to Haskell.
Prior Art ------------------------------------------------------------------------ Python supports Unicode natively since 3. Python's approach is intuitive and capable. Its `repr`, which is equivalent to Haskell's `show`, automatically escapes unreadable characters, but leaves readable characters unescaped. The criteria of "readable" can be found in CPython's code [3]. If we were to realize this proposal, Python could be a source of inspiration.
Unresolved Problems ------------------------------------------------------------------------ There are some currently unresolved (not discussed enough) issues.
+ Locales.
What if the specified locale does not support Unicode? Hécate Moonlight pointed out PEP-538 [4] could be a reference.
+ Unicode versions.
Javran Cheng pointed out u_iswprint is generated from a Unicode table, which is manually updated. This raises a concern that the definition of "printable" characters could change from version to version.
+ Definition of "readable".
Unicode already defined "printability". It's good, but it is not necessarily what we want here.
- Should we support RTL? - Should we design a Haskell-specific definition of readability, to avoid Unciode version silently introducing breakage?
(More?)
Some issues here perhaps require better answers to: What is our expectation of Show? Where should it be used? Should we expect it to break on every Unicode update?
[1] https://hackage.haskell.org/package/text-1.2.4.1/docs/src/Data.Text.Show.htm... https://hackage.haskell.org/package/text-1.2.4.1/docs/src/Data.Text.Show.htm... [2] https://mail.haskell.org/pipermail/haskell-cafe/2016-February/122874.html https://mail.haskell.org/pipermail/haskell-cafe/2016-February/122874.html [3] https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f797... https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f797... [4] https://www.python.org/dev/peps/pep-0538/ https://www.python.org/dev/peps/pep-0538/ ------------------------------------------------------------------------ Libraries mailing list Libraries@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
_______________________________________________ Libraries mailing list Libraries@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
_______________________________________________ Libraries mailing list Libraries@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries

On Thu, Jul 08, 2021 at 06:53:38PM +0300, Oleg Grenrus wrote:
An example failure is:
3587.md #1: FAIL (0.01s) --- test/command/3587.md +++ pandoc -f latex -t native + 1 [Para [Str "1 m",Space,Str "is",Space,Str "equal",Space,Str "to",Space,Str "1000 mm"]] - 1 [Para [Str "1\160m",Space,Str "is",Space,Str "equal",Space,Str "to",Space,Str "1000\160mm"]]
Str is a constructor of Inline type, and takes Text: data Inline = Str Text | ... As discussed on the GHC issue [1], Text and ByteString Show Instances piggyback on String instance. Bodigrim said that Text will eventually migrate to do the same as new Show String [2], so this issue will resurface.
Please explain the compatibility story. How library writes should write their code (in test-suites) which rely on Show String or Show Text, such that they could support GHC base versions (and/or text) versions on the both sides of this breaking change.
One possible approach is to apply "show @String . read @String" to normalise expected literal string or text fragments. This is easiest when the expected result is *code* in the test, since the transformation can be applied to the code that's encapsulates the expected value. When there are test *files* that hold the "expected output" of `show` for particular inputs, clearly if `show` changes there can't be a single fixed file that determines the success of the test case. So for reproducible tests, one might have to generate the "expected output" file, by normalising appropropriate fragments as above. Likely a QuasiQuoter can be defined to simplify the task. -- Viktor.

On Thu, Jul 08, 2021 at 06:11:28PM +0800, Kai Ma wrote:
It's proposed here to change the Show instance of String, to achieve the following output:
ghci> print "Hello, 世界” "Hello, 世界”
ghci> print "Hello, мир” "Hello, мир”
ghci> print "Hello, κόσμος” "Hello, κόσμος”
ghci> "Hello, 世界” “Hello, 世界”
ghci> "😀” “😀"
Another possibility is to extend the `Show` class with two new methods and their default implementations: class Show where ... showsPrecUnicode :: Int -> a -> ShowS showsPrecUnicode = showsPrec showListUnicode :: [a] -> ShowS showListUnicode = showList showUnicode :: Show a => a -> String showUnicode x = showsPrecUnicode 0 x "" at which point a small number of classes can override `showUnicode` and `showListUnicode`: instance Show a => Show [a] where showsPrec _ = showList showsPrecUnicode = showListUnicode instance Show Char where showsPrecUnicode = ... -- Unicode char showListUnicode = ... -- Unicode string instance Show Text where showsPrecUnicode = ... -- Unicode text Once these are implemented, "ghci" can be modified to instead used `showUnicode`, rather than `show`, with no new incompatibilities elsewhere. We can also introduce `uprint = putStrLn . showUnicode`, ... This would still require explicit opt-in to use the Unicode show, but it would be available for all `Show` instances, and used by default in "ghci". It would still be a good idea to implement `Render`, which is a related but separate concern. -- Viktor.

This is tricky design. Any instance in the composition which doesn't define showsPrecUnicode will ruin formatting of inner Strings. Maybe it's not as bad as in aeson* as most Show instances are derived. (But not Show1 etc.) * The problem with aeson is ToJSON class having toValue (old method), and toEncoding, which is fast but which default implementation is using slower toValue. Thus `instance ToJSON MyType` derived generally silently ruins the performance. It might be better to not define default implementation for showsPrecUnicode (cause most instances are derived), as though it will break explicitly written instances, but fixing them is straigh-forward. (GHC developers use head.hackage will hate this option though). I'm not sure this is any better then separate class, and whether two type-classes (either actually or two-in-one) is a good idea. - Oleg On 11.7.2021 7.01, Viktor Dukhovni wrote:
On Thu, Jul 08, 2021 at 06:11:28PM +0800, Kai Ma wrote:
It's proposed here to change the Show instance of String, to achieve the following output:
ghci> print "Hello, 世界” "Hello, 世界”
ghci> print "Hello, мир” "Hello, мир”
ghci> print "Hello, κόσμος” "Hello, κόσμος”
ghci> "Hello, 世界” “Hello, 世界”
ghci> "😀” “😀" Another possibility is to extend the `Show` class with two new methods and their default implementations:
class Show where ... showsPrecUnicode :: Int -> a -> ShowS showsPrecUnicode = showsPrec
showListUnicode :: [a] -> ShowS showListUnicode = showList
showUnicode :: Show a => a -> String showUnicode x = showsPrecUnicode 0 x ""
at which point a small number of classes can override `showUnicode` and `showListUnicode`:
instance Show a => Show [a] where showsPrec _ = showList showsPrecUnicode = showListUnicode
instance Show Char where showsPrecUnicode = ... -- Unicode char showListUnicode = ... -- Unicode string
instance Show Text where showsPrecUnicode = ... -- Unicode text
Once these are implemented, "ghci" can be modified to instead used `showUnicode`, rather than `show`, with no new incompatibilities elsewhere.
We can also introduce `uprint = putStrLn . showUnicode`, ...
This would still require explicit opt-in to use the Unicode show, but it would be available for all `Show` instances, and used by default in "ghci".
It would still be a good idea to implement `Render`, which is a related but separate concern.

On Sun, Jul 11, 2021 at 03:11:02PM +0300, Oleg Grenrus wrote:
This is tricky design. Any instance in the composition which doesn't define showsPrecUnicode will ruin formatting of inner Strings. Maybe it's not as bad as in aeson* as most Show instances are derived. (But not Show1 etc.)
Yes, if a manually defined instance uses `show` on a substructure that contains strings, the strings will be rendered with unicode characters escaped. As you note, the ability to retain Unicode representations of nested structures would depend on either derived instances or tweaks to manual instances to also define the Unicode-preserving methods. This would not be perfect, but would avoid disruption, and so should have a much easier path to deployment. Over time the instances that fail to preserve Unicode strings can be refined. If tests in various packages are incrementally changed to use `showUnicode` to verify expected output, eventually we may be able to make `show` also Unicode-preserving, with much less disruption. This can provide a migration path over O(10yr) from the status quo to a Unicode-friendly `show`. -- Viktor.
participants (10)
-
Bardur Arantsson
-
Henning Thielemann
-
Hécate
-
Julian Ospald
-
Kai Ma
-
Oleg Grenrus
-
Ruben Astudillo
-
Tikhon Jelvis
-
Travis Cardwell
-
Viktor Dukhovni