
Marcin 'Qrczak' Kowalczyk wrote:
But the default encoding should come from the locale instead of being ISO-8859-1.
The problem with that is that, if the locale's encoding is UTF-8, a lot of stuff is going to break (i.e. anything in ISO-8859-* which isn't limited to the 7-bit ASCII subset).
What about this transition path:
1. API for manipulating byte sequences in I/O (without representing them in String type).
Note that this needs to include all of the core I/O functions, not just reading/writing streams. E.g. FilePath is currently an alias for String, but (on Unix, at least) filenames are strings of bytes, not characters. Ditto for argv, environment variables, possibly other cases which I've overlooked.
2. API for conversion between explicitly specified encodings and byte sequences, including attaching converters to Handles. There is also a way to obtain the locale encoding.
3. The default encoding is settable from Haskell, defaults to ISO-8859-1.
Agreed.
4. Libraries are reviewed to ensure that they work with various encoding settings.
There are limits to the extent to which this can be achieved. E.g. what happens if you set the encoding to UTF-8, then call getDirectoryContents for a directory which contains filenames which aren't valid UTF-8 strings?
5. The default encoding is settable from Haskell, defaults to the locale encoding.
I feel that the default encoding should be one whose decoder cannot fail, e.g. ISO-8859-1. You should have to explicitly request the use of the locale's encoding (analogous to calling setlocale(LC_CTYPE, "") at the start of a C program; there's a good reason why C doesn't do this without being explicitly told to). Actually, the more I think about it, the more I think that "simple, stupid programs" probably shouldn't be using Unicode at all. I.e. Char, String, string literals, and the I/O functions in Prelude, IO etc should all be using bytes, with a distinct wide-character API available for people who want to make the (substantial) effort involved in writing (genuinely) internationalised programs. Right now, the attempt at providing I18N "for free", by defining Char to mean Unicode, has essentially backfired, IMHO. Anything that isn't ISO-8859-1 just doesn't work for the most part, and anyone who wants to provide real I18N first has to work around the pseudo-I18N that's already there (e.g. convert Chars back into Word8s so that they can decode them into real Chars). Oh, and because bytes are being stored in Chars, the type system won't help if you neglect to decode a string, or if you decode it twice.
The advantage of assuming ISO-8859-* is that the decoder can't fail; every possible stream of bytes is valid.
Assuming. But UTF-8 is not ISO-8859-*. When someday I change most of my files and filenames from ISO-8859-2 to UTF-8, and change the locale, the assumption will be wrong. I can't change that now, because too many programs would break.
The current ISO-8859-1 assumption is also wrong. A program written in Haskell which sorts strings would break for non-ASCII letters even now that they are ISO-8859-2 unless specified otherwise.
1. In that situation, you can't avoid the encoding issues. It doesn't matter what the default is, because you're going to have to set the encoding anyhow. 2. If you assume ISO-8859-1, you can always convert back to Word8 then re-decode as UTF-8. If you assume UTF-8, anything which is neither UTF-8 nor ASCII will fail far more severely than just getting the collation order wrong.
The key problem with using the locale is that you frequently encounter files which aren't in the locale's encoding, and for which the encoding can't easily be deduced.
Programs should either explicitly set the encoding for I/O on these files to ISO-8859-1, or manipulate them as binary data.
Well, my view is essentially that files should be treated as containing bytes unless you explicitly choose to decode them, at which point you have to specify the encoding.
The problem is that API for that yet is not even designed, so programs can't be written such that they will work after the default encoding change.
Personally, I would take the C approach: redefine Char to mean a byte
(i.e. CChar), treat string literals as bytes, keep the existing type
signatures on all of the existing Haskell98 functions, and provide a
completely new wide-character API for those who wish to use it.
That gets the failed attempt at I18N out of everyone's way with a
minimum of effort and with maximum backwards compatibility for
existing code.
Given the frequency with which this issue crops up, and the associated
lack of action to date, I'd rather not have to wait until someone
finally gets around to designing the new, improved,
genuinely-I18N-ised API before we can read/write arbitrary files
without too much effort.
My main concern is that someone will get sick of waiting and make the
wrong "fix", i.e. keep the existing API but default to the locale's
encoding, so that every simple program then has to explicitly set it
back to ISO-8859-1 to get reasonable worst-case behaviour.
--
Glynn Clements