[GHC] #10762: On Windows, out-of-codepage characters can cause GHC build to fail

#10762: On Windows, out-of-codepage characters can cause GHC build to fail -----------------------------------------+--------------------------------- Reporter: snoyberg | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.2 Keywords: | Operating System: Windows Architecture: x86_64 (amd64) | Type of failure: None/Unknown Test Case: | Blocked By: Blocking: | Related Tickets: Differential Revisions: | -----------------------------------------+--------------------------------- You can see where this hit us recently on stack with issues [https://github.com/commercialhaskell/stack/issues/738 738] and [https://github.com/commercialhaskell/stack/issues/734 734]. To demonstrate, I'm attaching a UTF-8 encoded Haskell program with some Hebrew characters, and some warnings. The contents of that file are: {{{#!hs module Main ( main , שלום ) where main :: IO () main = putStrLn שלום שלום = "shalom" }}} If I first set my codepage to 65001 (UTF-8), everything works as expected: {{{ C:\Users\Michael\Desktop>chcp 65001 Active code page: 65001 C:\Users\Michael\Desktop>ghc -fforce-recomp -Wall -ddump-hi -ddump-to-file shalom.hs [1 of 1] Compiling Main ( shalom.hs, shalom.o ) shalom.hs:9:1: Warning: Top-level binding with no type signature: שלום :: [Char] Linking shalom.exe ... }}} However, if I set my codepage to 437 (US), both the warnings sent to the console, and the .hi dump file, cause GHC to exit prematurely: {{{ C:\Users\Michael\Desktop>chcp 437 Active code page: 437 C:\Users\Michael\Desktop>ghc -fforce-recomp -Wall shalom.hs [1 of 1] Compiling Main ( shalom.hs, shalom.o ) shalom.hs:9:1: Warning: Top-level binding with no type signature: <stderr>: commitBuffer: invalid argument (invalid character) }}} {{{ C:\Users\Michael\Desktop>chcp 437 Active code page: 437 C:\Users\Michael\Desktop>ghc -fforce-recomp -ddump-hi -ddump-to-file shalom.hs [1 of 1] Compiling Main ( shalom.hs, shalom.o ) shalom.dump-hi: commitBuffer: invalid argument (invalid character) }}} At the very least, I would argue that -ddump-to-file should always dump to the output files as UTF-8, as this is the most useful for tooling. Beyond that, there are a few options here: * Have all output- including to the console- go out as UTF-8. This may not play terribly nicely with consoles without setting the output codepage. * Provide a command line option or environment variable to specify "output as UTF-8." * More radical: change the default way that all Handles work so that UTF-8 is the default, instead of paying attention to code pages and environment variables. Honestly, this is my preference, but it's a bigger discussion than this one bug. The workaround we've implemented in stack for now is setting the codepage to 65001 for the console while running stack. This is not ideal, since this is essentially a global setting for the entire console. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10762 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10762: On Windows, out-of-codepage characters can cause GHC build to fail ---------------------------------+----------------------------------------- Reporter: snoyberg | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.2 Resolution: | Keywords: Operating System: Windows | Architecture: x86_64 (amd64) Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: ---------------------------------+----------------------------------------- Changes (by snoyberg): * Attachment "shalom.hs" added. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10762 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10762: On Windows, out-of-codepage characters can cause GHC build to fail ---------------------------------+----------------------------------------- Reporter: snoyberg | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.2 Resolution: | Keywords: Operating System: Windows | Architecture: x86_64 (amd64) Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: ---------------------------------+----------------------------------------- Comment (by rwbarton): To what extent do you consider your suggestions Windows-specific? Since GHC assumes that input Haskell files are always UTF-8, one could argue that `-ddump-to-file` output should always be UTF-8 as well, on every system. However producing UTF-8-encoded console output when the locale specifies a different encoding makes no sense to me, so I hope that suggestion at least is for Windows only.
The workaround we've implemented in stack for now is setting the codepage to 65001 for the console while running stack. This is not ideal, since this is essentially a global setting for the entire console.
It must be my lack of experience with Windows, but I don't understand why this is considered a "workaround" rather than a basic necessity for having any kind of working system. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10762#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10762: On Windows, out-of-codepage characters can cause GHC build to fail ---------------------------------+----------------------------------------- Reporter: snoyberg | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.2 Resolution: | Keywords: Operating System: Windows | Architecture: x86_64 (amd64) Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: ---------------------------------+----------------------------------------- Comment (by snoyberg): I think that all file output from GHC should be UTF-8, regardless of OS, and regardless of locale settings. I won't make too many arguments around console output, except that even on Windows, there should be some environment variable or command line switch to force UTF-8 output. As it stands, the only way to capture output reliably from GHC on Windows is to change the codepage for the entire console. And speaking of which: I won't profess to be a Windows expert myself, but from my rather painful research this morning: a console codepage is not something which is merely inherited by subprocesses (the way an environment variable is), but by other processes in the same console. Unless I'm misunderstanding something, the workaround we're using in stack now could have negative consequences for things like running stack/GHC from inside a text editor, where the text editor may suddenly have its code page changed on it. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10762#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10762: On Windows, out-of-codepage characters can cause GHC build to fail ---------------------------------+----------------------------------------- Reporter: snoyberg | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.2 Resolution: | Keywords: Operating System: Windows | Architecture: x86_64 (amd64) Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: ---------------------------------+----------------------------------------- Comment (by rwbarton): (To expand on the last point: If your terminal is not actually configured to interpret console output as UTF-8 then you will not be able to read the warnings if GHC insisted on outputting UTF-8. And if it is configured to do so then why is the code page not set to 65001?) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10762#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10762: On Windows, out-of-codepage characters can cause GHC build to fail ---------------------------------+----------------------------------------- Reporter: snoyberg | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.2 Resolution: | Keywords: Operating System: Windows | Architecture: x86_64 (amd64) Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: ---------------------------------+----------------------------------------- Comment (by snoyberg):
(To expand on the last point: If your terminal is not actually configured to interpret console output as UTF-8 then you will not be able to read the warnings if GHC insisted on outputting UTF-8. And if it is configured to do so then why is the code page not set to 65001?)
The wonders of the Windows world still mystify me, so hopefully someone more informed can chime in. However, I was able to get both the standard Windows console and ConEmu to display non-ASCII characters (limited only by my font choice) by switching codepages. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10762#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10762: On Windows, out-of-codepage characters can cause GHC build to fail ---------------------------------+----------------------------------------- Reporter: snoyberg | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.2 Resolution: | Keywords: Operating System: Windows | Architecture: x86_64 (amd64) Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: ---------------------------------+----------------------------------------- Comment (by rwbarton): Well, I recognize that this is a pain point on Windows, so as long as nothing changes on non-Windows except, possibly, writing `-ddump-to-file` output in UTF-8 always, then whatever happens on Windows is fine with me... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10762#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10762: On Windows, out-of-codepage characters can cause GHC build to fail ---------------------------------+----------------------------------------- Reporter: snoyberg | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.2 Resolution: | Keywords: Operating System: Windows | Architecture: x86_64 (amd64) Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #6037 | Differential Revisions: ---------------------------------+----------------------------------------- Changes (by rwbarton): * related: => #6037 Comment: See also #6037. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10762#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10762: On Windows, out-of-codepage characters can cause GHC build to fail ---------------------------------+----------------------------------------- Reporter: snoyberg | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.2 Resolution: | Keywords: Operating System: Windows | Architecture: x86_64 (amd64) Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #6037 | Differential Revisions: ---------------------------------+----------------------------------------- Comment (by snoyberg): After this discussion, here's the proposal I'd make, which is a libraries change, not just a GHC-internal change: * The default filesystem encoding will always be UTF-8, regardless of environment variables or codepage * We modify the encoding logic on Windows to also respect an environment variable (perhaps LANG, not sure) to override the codepage for stdin/stdout/stderr character encoding One I'm on the fence about: * Modify the character encoding codepaths to not throw an exception on an unhandled character when using the default encoding for stdin/stdout/stderr. This would be nice in that GHC wouldn't die on printing a warning if the codepage/LANG variable is set to something unusual, but does have the property of just swallowing problematic behavior. I'm fairly certain that will solve the major problems we have with GHC, and will fit the stack use case nicely. Does anyone else reading see a problem with this approach before I bring it up with the core libraries committee? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10762#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10762: On Windows, out-of-codepage characters can cause GHC build to fail ---------------------------------+----------------------------------------- Reporter: snoyberg | Owner: snoyberg Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.2 Resolution: | Keywords: Operating System: Windows | Architecture: x86_64 (amd64) Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #6037 | Differential Revisions: ---------------------------------+----------------------------------------- Changes (by snoyberg): * owner: => snoyberg -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10762#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10762: On Windows, out-of-codepage characters can cause GHC build to fail ---------------------------------+----------------------------------------- Reporter: snoyberg | Owner: snoyberg Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.2 Resolution: | Keywords: Operating System: Windows | Architecture: x86_64 (amd64) Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #6037 | Differential Revisions: ---------------------------------+----------------------------------------- Comment (by snoyberg): Patch sent to force dump files to be UTF-8 encoded. https://phabricator.haskell.org/D1151 I'm looking into warnings now -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10762#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10762: On Windows, out-of-codepage characters can cause GHC build to fail ---------------------------------+----------------------------------------- Reporter: snoyberg | Owner: snoyberg Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.10.2 Resolution: | Keywords: Operating System: Windows | Architecture: x86_64 (amd64) Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #6037 | Differential Revisions: ---------------------------------+----------------------------------------- Comment (by snoyberg): https://phabricator.haskell.org/D1153 also sent for the warnings, but we can follow up on that in #6037. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10762#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10762: On Windows, out-of-codepage characters can cause GHC build to fail
---------------------------------+-----------------------------------------
Reporter: snoyberg | Owner: snoyberg
Type: bug | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 7.10.2
Resolution: | Keywords:
Operating System: Windows | Architecture: x86_64 (amd64)
Type of failure: None/Unknown | Test Case:
Blocked By: | Blocking:
Related Tickets: #6037 | Differential Revisions:
---------------------------------+-----------------------------------------
Comment (by Ben Gamari

#10762: On Windows, out-of-codepage characters can cause GHC build to fail
---------------------------------+-----------------------------------------
Reporter: snoyberg | Owner: snoyberg
Type: bug | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 7.10.2
Resolution: | Keywords:
Operating System: Windows | Architecture: x86_64 (amd64)
Type of failure: None/Unknown | Test Case:
Blocked By: | Blocking:
Related Tickets: #6037 | Differential Revisions:
---------------------------------+-----------------------------------------
Comment (by Ben Gamari

#10762: On Windows, out-of-codepage characters can cause GHC build to fail ---------------------------------+----------------------------------------- Reporter: snoyberg | Owner: snoyberg Type: bug | Status: closed Priority: normal | Milestone: Component: Compiler | Version: 7.10.2 Resolution: fixed | Keywords: Operating System: Windows | Architecture: x86_64 (amd64) Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #6037 | Differential Revisions: ---------------------------------+----------------------------------------- Changes (by bgamari): * status: new => closed * resolution: => fixed -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10762#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10762: On Windows, out-of-codepage characters can cause GHC build to fail ---------------------------------+----------------------------------------- Reporter: snoyberg | Owner: snoyberg Type: bug | Status: closed Priority: normal | Milestone: Component: Compiler | Version: 7.10.2 Resolution: fixed | Keywords: Operating System: Windows | Architecture: x86_64 (amd64) Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #6037 | Differential Revisions: ---------------------------------+----------------------------------------- Comment (by snoyberg): Thanks Ben for the merges. I had plans for one more potential change: setting up some environment variable that is respected by GHC on Windows to override the code page for stdout and stderr. Use case: capturing output from GHC in a build tool. My questions are: 1. Any objection to this in principle? 2. Do you want a new ticket opened for that? 3. Any thoughts on the environment variable? I was thinking about just reusing LANG and checking for a value ending in .UTF-8 so that the same code could be used on Windows and non-Windows for setting the environment variable. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10762#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10762: On Windows, out-of-codepage characters can cause GHC build to fail ---------------------------------+----------------------------------------- Reporter: snoyberg | Owner: snoyberg Type: bug | Status: closed Priority: normal | Milestone: Component: Compiler | Version: 7.10.2 Resolution: fixed | Keywords: Operating System: Windows | Architecture: x86_64 (amd64) Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #6037 | Differential Revisions: ---------------------------------+----------------------------------------- Comment (by snoyberg): I've sent one final diff for this issue, implementing the environment override logic I mentioned here: https://phabricator.haskell.org/D1167 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10762#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10762: On Windows, out-of-codepage characters can cause GHC build to fail
---------------------------------+-----------------------------------------
Reporter: snoyberg | Owner: snoyberg
Type: bug | Status: closed
Priority: normal | Milestone:
Component: Compiler | Version: 7.10.2
Resolution: fixed | Keywords:
Operating System: Windows | Architecture: x86_64 (amd64)
Type of failure: None/Unknown | Test Case:
Blocked By: | Blocking:
Related Tickets: #6037 | Differential Revisions:
---------------------------------+-----------------------------------------
Comment (by Ben Gamari

#10762: On Windows, out-of-codepage characters can cause GHC build to fail ----------------------------------+-------------------------------------- Reporter: snoyberg | Owner: snoyberg Type: bug | Status: closed Priority: normal | Milestone: 8.0.1 Component: Compiler | Version: 7.10.2 Resolution: fixed | Keywords: Operating System: Windows | Architecture: x86_64 (amd64) Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: #6037, #15021 | Differential Rev(s): Wiki Page: | ----------------------------------+-------------------------------------- Changes (by nh2): * related: #6037 => #6037, #15021 Comment: Related ticket for `ghc-pkg list` that also breaks build tools: #15021 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10762#comment:18 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10762: On Windows, out-of-codepage characters can cause GHC build to fail
----------------------------------+--------------------------------------
Reporter: snoyberg | Owner: snoyberg
Type: bug | Status: closed
Priority: normal | Milestone: 8.0.1
Component: Compiler | Version: 7.10.2
Resolution: fixed | Keywords:
Operating System: Windows | Architecture: x86_64 (amd64)
Type of failure: None/Unknown | Test Case:
Blocked By: | Blocking:
Related Tickets: #6037, #15021 | Differential Rev(s):
Wiki Page: |
----------------------------------+--------------------------------------
Comment (by Ben Gamari
participants (1)
-
GHC