
Sven Panne
Hmmm, the Unicode tables start with ISO-Latin-1, so what would exactly break when we stipulate that the standard encoding for string I/O in Haskell is ISO-Latin-1? Additional encodings could be specified e.g. via a new "open" variant.
That the encoding of most file contents is not ISO-Latin-1 in practice. The locale mechanism specifies a default. It's also a default for other things: filenames (on Unix), program invocation arguments, environment variables etc. Some other places have an encoding hardwired (e.g. Gtk+ uses UTF-8 and Qt uses UTF-16), and yet others have it specified as a part of the protocol (email, usenet, WWW). Unfortunately changing a Haskell implementation to actually convert between the external encodings and Unicode must be done in all those places at once, otherwise there will be mismatches and e.g. printing program invocation arguments to a file will have a wrong effect. Most Haskell programs currently work because they misuse Chars to represent characters in the implicit default encoding. As long as they don't use isAlpha or toUpper on non-ASCII characters, and as long as they don't try to support several encodings at once. These two paradigms: A. Represent strings using their original encoding. B. Use Unicode internally, convert it at the boundaries. should not be mixed in one string type, or confusion will arise. For at least some of these places, e.g. file contents or socket data, a program must have a way to specify a different encoding, and also to manipulate raw bytes without recoding. But the default encoding should come from the locale instead of being ISO-8859-1. A Char value should always mean a Unicode code point and not e.g. an ISO-8859-2-coded value. This is the B paradigm and it must be applied consistently. I did this for my language http://kokogut.sourceforge.net/ and it works. Only some things are hard, e.g. reading a file whose encoding is specified inside it (trying to apply the default encoding might fail, even if the text before the encoding name is all ASCII, because of buffering); it's possible but needs care. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/