Re: H98 Text IO

27 Feb 2008

      On Tue, 2008-02-26 at 14:18 +0000, Simon Marlow wrote:
...
Simon Marlow wrote:
...
Duncan Coutts wrote:
Let's call this one proposal 0:
...
...
* Haskell98 file IO should always use UTF-8.
      * Haskell98 IO to terminals should use the current locale
        encoding.
and the others:
...
1. all text I/O is in the locale encoding (what C and Hugs do)
2. stdin/stdout/stderr and terminals are always in the locale
     encoding, everything else is UTF-8
3. everything is UTF-8
So it's clear that all these solutions have some downsides. We have to
decide what is more important.

Let me try and summarise:

basically we can be consistent with the OS environment or consistent
with other Haskell systems in other environments or try to get some
mixture of the two. It is pretty clear however that trying to get a
mixture still leads to some inconsistency with the OS environment.

      * "status quo" (what ghc/hugs do now)
        This gives consistency with the OS environment with hugs and jhc
        but not ghc, nhc or yhc. It gives consistency between haskell
        programs (using the same haskell implementation) on different
        platforms for ghc and nhc but not for hugs or jhc. There is no
        consistency between haskell implementations.

      * "always locale" (solution 1 above)
        This gives us consistency with the OS environment. All of the
        shell snippets people have posted work with this. The main
        disadvantage is that files moved between systems may be
        interpreted differently.

      * "always utf8" (solution 3 above)
        This gives consistency between Haskell programs across
        platforms. The main disadvantage is that it is very unhelpful if
        the locale is not UTF8. It fails the "putStr" test of printing
        string literals to the terminal.

      * "mixture A" (solution 0 above)
        The input/output format changes depending on the device. prog |
        cat prints junk in non-UTF8 locales.

      * "mixture B"  (solution 2 above)
        The output format changes depending on the device. prog in
        behaves differently to prog < in.

And some example people have noted:

      * putStr "αβγδεζηθικλ"
        That is just printing a string literal to the console/terminal.
        Now that major implementations support Unicode .hs source files
        it's kind of nice if this works.

        This works with "always locale" and "mixture A" and "mixture B"
        above. This fails for "status quo" with ghc (but works for hugs)
        and fails for "always utf8" unless the locale happens to be
        utf8.

      * ./prog  vs  ./prog | cat
        That is, piping the output of a haskell program through cat and
        printing the result to a terminal produces the same output as
        displaying the program output directly.

        This works with "always locale" and "mixture B" and fails with
        "mixture A". With "always utf8" and with "status quo" it has the
        property that it consistently produces the same junk on the
        terminal  which some people see as a bonus (when not in a utf8
        or latin1 locale respectively).

      * ./prog  vs  ./prog >file; cat file
        This is another variation on the above and it has the same
        failures.

      * ./prog in  vs  ./prog < in
        That is reading a file given as a command line arg via readFile
        gives the same result as reading stdin that has been redirected
        from a the same file.

        This works with "always locale" and "mixture A" and fails with
        "mixture B". This is the dual of the previous two examples. This
        fails with "always utf8" and with "status quo" when the file was
        produced by another text processing program from the same
        environment (eg a generic text editor).

      * ./foo vs  ./foo | hexdump -C
        The output bytes we get sent to the terminal is exactly the same
        as what we see piped to a program to examine those bytes.

        This fails for "mixture A" and works for all the others. Works
        in the strict sense that the bytes are the same, not in the
        sense that the text output is readable.

So the problem with the mixture approaches is that the terminal and
files and pipes are all really interchangeable so we can find surprising
inconsistencies within the same OS environment.

The problem with the "always utf8" is that it's never right unless the
locale is set to utf8.

As a data point, Java and python use "always locale" as default if you
don't specify an encoding when opening a text stream.

I think personally I'm coming round to the "always locale" point of
view. We already have no cross-platform consistency for text files
because of the lf vs cr/lf issue and we have no cross-implementation
consistency.

Duncan