
On Tue, 2008-02-26 at 14:18 +0000, Simon Marlow wrote:
Simon Marlow wrote:
Duncan Coutts wrote:
Let's call this one proposal 0:
* Haskell98 file IO should always use UTF-8. * Haskell98 IO to terminals should use the current locale encoding.
and the others:
1. all text I/O is in the locale encoding (what C and Hugs do)
2. stdin/stdout/stderr and terminals are always in the locale encoding, everything else is UTF-8
3. everything is UTF-8
So it's clear that all these solutions have some downsides. We have to decide what is more important. Let me try and summarise: basically we can be consistent with the OS environment or consistent with other Haskell systems in other environments or try to get some mixture of the two. It is pretty clear however that trying to get a mixture still leads to some inconsistency with the OS environment. * "status quo" (what ghc/hugs do now) This gives consistency with the OS environment with hugs and jhc but not ghc, nhc or yhc. It gives consistency between haskell programs (using the same haskell implementation) on different platforms for ghc and nhc but not for hugs or jhc. There is no consistency between haskell implementations. * "always locale" (solution 1 above) This gives us consistency with the OS environment. All of the shell snippets people have posted work with this. The main disadvantage is that files moved between systems may be interpreted differently. * "always utf8" (solution 3 above) This gives consistency between Haskell programs across platforms. The main disadvantage is that it is very unhelpful if the locale is not UTF8. It fails the "putStr" test of printing string literals to the terminal. * "mixture A" (solution 0 above) The input/output format changes depending on the device. prog | cat prints junk in non-UTF8 locales. * "mixture B" (solution 2 above) The output format changes depending on the device. prog in behaves differently to prog < in. And some example people have noted: * putStr "αβγδεζηθικλ" That is just printing a string literal to the console/terminal. Now that major implementations support Unicode .hs source files it's kind of nice if this works. This works with "always locale" and "mixture A" and "mixture B" above. This fails for "status quo" with ghc (but works for hugs) and fails for "always utf8" unless the locale happens to be utf8. * ./prog vs ./prog | cat That is, piping the output of a haskell program through cat and printing the result to a terminal produces the same output as displaying the program output directly. This works with "always locale" and "mixture B" and fails with "mixture A". With "always utf8" and with "status quo" it has the property that it consistently produces the same junk on the terminal which some people see as a bonus (when not in a utf8 or latin1 locale respectively). * ./prog vs ./prog >file; cat file This is another variation on the above and it has the same failures. * ./prog in vs ./prog < in That is reading a file given as a command line arg via readFile gives the same result as reading stdin that has been redirected from a the same file. This works with "always locale" and "mixture A" and fails with "mixture B". This is the dual of the previous two examples. This fails with "always utf8" and with "status quo" when the file was produced by another text processing program from the same environment (eg a generic text editor). * ./foo vs ./foo | hexdump -C The output bytes we get sent to the terminal is exactly the same as what we see piped to a program to examine those bytes. This fails for "mixture A" and works for all the others. Works in the strict sense that the bytes are the same, not in the sense that the text output is readable. So the problem with the mixture approaches is that the terminal and files and pipes are all really interchangeable so we can find surprising inconsistencies within the same OS environment. The problem with the "always utf8" is that it's never right unless the locale is set to utf8. As a data point, Java and python use "always locale" as default if you don't specify an encoding when opening a text stream. I think personally I'm coming round to the "always locale" point of view. We already have no cross-platform consistency for text files because of the lf vs cr/lf issue and we have no cross-implementation consistency. Duncan