H98 Text IO

26 Feb 2008

...
From the H98 report:
All I/O functions defined here are character oriented. [...]
        These functions cannot be used portably for binary I/O.

        In the following, recall that String is a synonym for [Char]
        (Section 6.1.2).

So ordinary text Handles are for text, not binary. Char is of course a
Unicode code point.

The crucial question of course is what encoding of text to use. For the
H98 IO functions we cannot set it as a parameter, we have to pick a
sensible default. Currently different implementations disagree on that
default. Hugs has for some time used the current locale on posix systems
(and I'm guessing the current code page on windows). GHC has always used
the Latin-1 encoding.

These days, most operating systems use a locale/codepage encoding that
covers full the Unicode range. So on hugs we get the benefit of that but
on GHC we do not.

This is endlessly surprising for beginners. They do
putStrLn "αβγδεζηθικλ"
and it comes out on their terminal as junk.

It also causes problems for serious programs, see for example the recent
hand-wringing on cabal-devel.

So here is a concrete proposal:

      * Haskell98 file IO should always use UTF-8.
      * Haskell98 IO to terminals should use the current locale
        encoding.

The main controversial point I think is whether to always use UTF-8 or
always use the current locale or some split as I've suggested. C chose
to always go with the current locale. Some people think that was a
mistake because the interpretation changes from user to user.

For terminals it is more clear cut that the locale is the right choice
because that is what the terminal is capable of displaying. Using
anything else will produce junk. We can detect if a handle is a terminal
when we open it using hIsTerminalDevice. This should be done
automatically (and ghc would ghc get it for free because it already does
that check to determine default buffering modes).

Sockets and pipes would be treated the same as files when opened in the
default text mode. The only special case is terminals.

The major problem is with code that assumes GHC's Handles are
essentially Word8 and layer their own UTF8 or other decoding over the
top. The utf8-string package has this problem for example. Such code
should be using openBinaryFile because they are reading/writing binary
data, not String text.

Note that many programs that really need to work with binary file
already use openBinaryFile, those that do not are already broken on
Windows which does cr/lf conversion on text files which breaks many
binary formats (though not utf8).

So we have decide which is more painful, keeping a limited text IO
system in GHC or breaking some existing programs which assume GHC's
current behaviour.

Opinions?

Please can we keep this discussion to the interpretation of the H98 IO
functions and not get into the separate discussion of how we could
extend or redesign the whole IO system. This is a questions of what are
the right defaults.

Duncan

H98 Text IO

Duncan Coutts