
From the H98 report:
All I/O functions defined here are character oriented. [...] These functions cannot be used portably for binary I/O. In the following, recall that String is a synonym for [Char] (Section 6.1.2). So ordinary text Handles are for text, not binary. Char is of course a Unicode code point. The crucial question of course is what encoding of text to use. For the H98 IO functions we cannot set it as a parameter, we have to pick a sensible default. Currently different implementations disagree on that default. Hugs has for some time used the current locale on posix systems (and I'm guessing the current code page on windows). GHC has always used the Latin-1 encoding. These days, most operating systems use a locale/codepage encoding that covers full the Unicode range. So on hugs we get the benefit of that but on GHC we do not. This is endlessly surprising for beginners. They do putStrLn "αβγδεζηθικλ" and it comes out on their terminal as junk. It also causes problems for serious programs, see for example the recent hand-wringing on cabal-devel. So here is a concrete proposal: * Haskell98 file IO should always use UTF-8. * Haskell98 IO to terminals should use the current locale encoding. The main controversial point I think is whether to always use UTF-8 or always use the current locale or some split as I've suggested. C chose to always go with the current locale. Some people think that was a mistake because the interpretation changes from user to user. For terminals it is more clear cut that the locale is the right choice because that is what the terminal is capable of displaying. Using anything else will produce junk. We can detect if a handle is a terminal when we open it using hIsTerminalDevice. This should be done automatically (and ghc would ghc get it for free because it already does that check to determine default buffering modes). Sockets and pipes would be treated the same as files when opened in the default text mode. The only special case is terminals. The major problem is with code that assumes GHC's Handles are essentially Word8 and layer their own UTF8 or other decoding over the top. The utf8-string package has this problem for example. Such code should be using openBinaryFile because they are reading/writing binary data, not String text. Note that many programs that really need to work with binary file already use openBinaryFile, those that do not are already broken on Windows which does cr/lf conversion on text files which breaks many binary formats (though not utf8). So we have decide which is more painful, keeping a limited text IO system in GHC or breaking some existing programs which assume GHC's current behaviour. Opinions? Please can we keep this discussion to the interpretation of the H98 IO functions and not get into the separate discussion of how we could extend or redesign the whole IO system. This is a questions of what are the right defaults. Duncan