
Glynn Clements
1. API for manipulating byte sequences in I/O (without representing them in String type).
Note that this needs to include all of the core I/O functions, not just reading/writing streams. E.g. FilePath is currently an alias for String, but (on Unix, at least) filenames are strings of bytes, not characters. Ditto for argv, environment variables, possibly other cases which I've overlooked.
They don't hold binary data; they hold data intended to be interpreted as text. If the encoding of the text doesn't agree with the locale, the environment setup is broken and 'ls' and 'env' misbehave on an UTF-8 terminal. A program can explicitly set the default encoding to ISO-8859-1 if it wishes to do something in a broken environment.
4. Libraries are reviewed to ensure that they work with various encoding settings.
There are limits to the extent to which this can be achieved. E.g. what happens if you set the encoding to UTF-8, then call getDirectoryContents for a directory which contains filenames which aren't valid UTF-8 strings?
The library fails. Don't do that. This environment is internally inconsistent.
I feel that the default encoding should be one whose decoder cannot fail, e.g. ISO-8859-1.
But filenames on my filesystem and most file contents are *not* encoded in ISO-8859-1. Assuming that they are ISO-8859-1 is plainly wrong.
You should have to explicitly request the use of the locale's encoding (analogous to calling setlocale(LC_CTYPE, "") at the start of a C program; there's a good reason why C doesn't do this without being explicitly told to).
C usually uses the paradigm of representing text in their original 8-bit encodings. This is why getting C programs to work in a UTF-8 locale is such a pain. Only some programs use wchar_t internally. Java and C# uses the paradigm of representing text in Unicode internally, recoding it on boundaries with the external world. The second paradigm has a cost that you must be aware what encodings are used in texts you manipulate. Locale gives a reasonable default for simple programs which aren't supposed to work with multiple encodings, and it specifies the encoding of texts which don't have an encoding specified elsewhere (terminal I/O, filenames, environment variables). It also has benefits: 1. It's easier to work with multiple encodings, because the internal representation can represent text decoded from any of them and is the same in all places of the program. 2. It's much easier to work in a UTF-8 environment, and to work with libraries which use Unicode internally (e.g. Gtk+ or Qt). 3. isAlpha, toUpper etc. are true pure functions. (Haskell API is broken in a different way here: toUpper should be defined in terms of strings, not characters.)
Actually, the more I think about it, the more I think that "simple, stupid programs" probably shouldn't be using Unicode at all.
This attitude causes them to break in a UTF-8 environment, which is why I can't use it as a default yet. ncurses wide character API is still broken. I reported bugs, the author acknowledged them, but hasn't fixed them. (Attributes are ignored on add_wch; get_wch is wrong for non-ASCII keys pressed if the locale is different from ISO-8859-1 and UTF-8.) It seems people don't use that API yet, because C traditionally uses the model of representing texts in byte sequences. But the narrow character API of ncurses is unusable with UTF-8 - this is not an implementation limitation but inherent limitation of the interface.
I.e. Char, String, string literals, and the I/O functions in Prelude, IO etc should all be using bytes, with a distinct wide-character API available for people who want to make the (substantial) effort involved in writing (genuinely) internationalised programs.
This would cause excessive duplication of APIs. Look, Java and C# don't do that. Only file contents handling needs a byte API, because many files don't contain text. This would imply isAlpha :: Char -> IO Bool.
Right now, the attempt at providing I18N "for free", by defining Char to mean Unicode, has essentially backfired, IMHO.
Because it needs to be accompanied with character recoders, both invoked explicitly (also lazily) and attached to file handles, and with a way to obtain recoders for various encodings. Assuming that the same encoding is used everywhere and programs can just copy bytes without interpreting them no longer works today. A mail client is expected to respect the encoding set in headers.
Oh, and because bytes are being stored in Chars, the type system won't help if you neglect to decode a string, or if you decode it twice.
This is why I said "1. API for manipulating byte sequences in I/O (without representing them in String type)".
2. If you assume ISO-8859-1, you can always convert back to Word8 then re-decode as UTF-8. If you assume UTF-8, anything which is neither UTF-8 nor ASCII will fail far more severely than just getting the collation order wrong.
If I know the encoding, I should set the I/O handle to that encoding in the first place instead of reinterpreting characters which have been read using the default.
Well, my view is essentially that files should be treated as containing bytes unless you explicitly choose to decode them, at which point you have to specify the encoding.
No problem, you can use the byte I/O API for text files if you wish. But it will not work vice versa.
Personally, I would take the C approach: redefine Char to mean a byte (i.e. CChar), treat string literals as bytes, keep the existing type signatures on all of the existing Haskell98 functions, and provide a completely new wide-character API for those who wish to use it.
Well, this is the paradigm which has problems in different areas. It will often break in UTF-8 locale, it needs isAlpha :: Char -> IO Bool, and it's painful to support multiple encodings. Char is *the* new API. What is missing is byte API in areas which work with arbitrary binary data (mostly file contents).
My main concern is that someone will get sick of waiting and make the wrong "fix", i.e. keep the existing API but default to the locale's encoding, so that every simple program then has to explicitly set it back to ISO-8859-1 to get reasonable worst-case behaviour.
Supporting byte I/O and supporting character recoding needs to be done before this. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/