
Glynn Clements
But the default encoding should come from the locale instead of being ISO-8859-1.
The problem with that is that, if the locale's encoding is UTF-8, a lot of stuff is going to break (i.e. anything in ISO-8859-* which isn't limited to the 7-bit ASCII subset).
What about this transition path: 1. API for manipulating byte sequences in I/O (without representing them in String type). 2. API for conversion between explicitly specified encodings and byte sequences, including attaching converters to Handles. There is also a way to obtain the locale encoding. 3. The default encoding is settable from Haskell, defaults to ISO-8859-1. 4. Libraries are reviewed to ensure that they work with various encoding settings. 5. The default encoding is settable from Haskell, defaults to the locale encoding. Points 1-3 don't change the behavior of existing programs, but they allow to start writing libraries and programs which manipulate something other than texts in the default encoding and will work in future. After relevant libraries work with the default encoding changed, programs which use them may begin their main function with setting the default encoding to the locale encoding. Finally, when we consider libraries and programs which break in this setting obsolete, the default is changed.
The advantage of assuming ISO-8859-* is that the decoder can't fail; every possible stream of bytes is valid.
Assuming. But UTF-8 is not ISO-8859-*. When someday I change most of my files and filenames from ISO-8859-2 to UTF-8, and change the locale, the assumption will be wrong. I can't change that now, because too many programs would break. The current ISO-8859-1 assumption is also wrong. A program written in Haskell which sorts strings would break for non-ASCII letters even now that they are ISO-8859-2 unless specified otherwise.
The key problem with using the locale is that you frequently encounter files which aren't in the locale's encoding, and for which the encoding can't easily be deduced.
Programs should either explicitly set the encoding for I/O on these files to ISO-8859-1, or manipulate them as binary data. The problem is that API for that yet is not even designed, so programs can't be written such that they will work after the default encoding change.
OTOH, if you assume UTF-8 (e.g. because that happens to be the locale's encoding), the decoder is likely to abort shortly after the first non-ASCII character it finds (either that, or it will just silently drop characters).
Detectable errors should not be automatically silenced, so it would fail. So the change to the default encoding must be done some time after it's possible to write programs which would not fail. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/