
Duncan Coutts wrote:
From the H98 report:
All I/O functions defined here are character oriented. [...] These functions cannot be used portably for binary I/O.
In the following, recall that String is a synonym for [Char] (Section 6.1.2).
So ordinary text Handles are for text, not binary. Char is of course a Unicode code point.
The crucial question of course is what encoding of text to use. For the H98 IO functions we cannot set it as a parameter, we have to pick a sensible default. Currently different implementations disagree on that default. Hugs has for some time used the current locale on posix systems (and I'm guessing the current code page on windows). GHC has always used the Latin-1 encoding.
These days, most operating systems use a locale/codepage encoding that covers full the Unicode range. So on hugs we get the benefit of that but on GHC we do not.
This is endlessly surprising for beginners. They do putStrLn "αβγδεζηθικλ" and it comes out on their terminal as junk.
It also causes problems for serious programs, see for example the recent hand-wringing on cabal-devel.
So here is a concrete proposal:
* Haskell98 file IO should always use UTF-8. * Haskell98 IO to terminals should use the current locale encoding.
While I support Duncan's proposal (we discussed it on IRC), I thought I should point out some of the ramifications of this, and the alternatives. If everything that is not a terminal uses UTF-8 by default, then shell commands may behave in an unexpected way, e.g. for a Haskell program "prog", prog | cat will output in UTF-8, and if your locale encoding is something other than UTF-8 you'll see junk. Similarly, prog >file; cat file will give the same (wrong) result. So some alternatives that fix this are 1. all text I/O is in the locale encoding (what C and Hugs do) 2. stdin/stdout/stderr and terminals are always in the locale encoding, everything else is UTF-8 3. everything is UTF-8 (1) has the advantage of being easy to understand, but causes problems when you want to move a file created on one system to another system, or share files between users. The programmer in this case has to anticipate the problem and set an encoding (and we're not proposing to provide a way to specify encodings, yet, so openBinaryFile and a separate UTF-8 step would be required). (2) has a sort of "do what I want" feel, and will almost certanly cause confusion in some cases, simply because it's an aribtrary choice. (3) is easy to understand, but does the wrong thing for people who have a locale encoding other than UTF-8. Duncan's proposal occupies a useful point: text that we know to be ephemeral, because it is being sent to a terminal, is definitely sent in the user's default encoding. Text that might be persistent or might be crossing a locale-boundary is always written in UTF-8, which is good for interchange and portability, the catch is that sometimes we identify a Handle as persistent when it is really ephemeral. Note that sensible people who set their locale to UTF-8 are not affected by any of this - and that includes most new installations of Linux these days, I believe. Cheers, Simon