
Glynn Clements
Unless you are the sole user of a system, you have no control over what filenames may occur on it (and even if you are the sole user, you may wish to use packages which don't conform to your rules).
For these occasions you may set the encoding to ISO-8859-1. But then you can't sensibly show them to the user in a GUI, nor in ncurses using the wide character API, nor you can't sensibly store them in a file which is to be always encoded in UTF-8 (e.g. XML file where you can't put raw bytes without knowing their encoding). There are two paradigms: manipulate bytes not knowing their encoding, and manipulating characters explicitly encoded in various encodings (possibly UTF-8). The world is slowly migrating from the first to the second.
There are limits to the extent to which this can be achieved. E.g. what happens if you set the encoding to UTF-8, then call getDirectoryContents for a directory which contains filenames which aren't valid UTF-8 strings?
The library fails. Don't do that. This environment is internally inconsistent.
Call it what you like, it's a reality, and one which programs need to deal with.
The reality is that filenames are encoded in different encodings depending on the system. Sometimes it's ISO-8859-1, sometimes ISO-8859-2, sometimes UTF-8. We should not ignore the possibility of UTF-8-encoded filenames. In CLisp it fails silently (undecodable filenames are skipped), which is bad. It should fail loudly.
Most programs don't care whether any filenames which they deal with are valid in the locale's encoding (or any other encoding). They just receive lists (i.e. NUL-terminated arrays) of bytes and pass them directly to the OS or to libraries.
And this is why I can't switch my home environment to UTF-8 yet. Too many programs are broken; almost all terminal programs which use more than stdin and stdout in default modes, i.e. which use line editing or work in full screen. How would you display a filename in a full screen text editor, such that it works in a UTF-8 environment?
If the assumed encoding is ISO-8859-*, this program will work regardless of the filenames which it is passed or the contents of the file (modulo the EOL translation on Windows). OTOH, if it were to use UTF-8 (e.g. because that was the locale's encoding), it wouldn't work correctly if either filename or the file's contents weren't valid UTF-8.
A program is not supposed to encounter filenames which are not representable in the locale's encoding. In your setting it's impossible to display a filename in a way other than printing to stdout.
More accurately, it specifies which encoding to assume when you *need* to know the encoding (i.e. ctype.h etc), but you can't obtain that information from a more reliable source.
In the case of filenames there is no more reliable source.
My central point is that the existing API forces the encoding to be an issue when it shouldn't be.
It is an unavoidable issue because not every interface in a given computer system uses the same encoding. Gtk+ uses UTF-8; you must convert text to UTF-8 in order to display it, and in order to convert you must know its encoding.
Well, to an extent it is an implementation issue. Historically, curses never cared about encodings. A character is a byte, you draw bytes on the screen, curses sends them directly to the terminal.
This is the old API. But newer ncurses API is prepared even for combining accents. A character is coded with a sequence of wchar_t values, such that all except the first one are combining characters.
Furthermore, the curses model relies upon monospaced fonts, and falls down once you encounter CJK text (where a "monospaced" font means one whose glyphs are an integer multiple of the cell size, not necessarily a single cell).
It doesn't fall. Characters may span several columns. There is wcwidth(), and curses specification in X/Open says how it should behave for wide CJK characters. I haven't tested it but I believe ncurses supports them.
Extending something like curses to handle encoding issues is far from trivial; which is probably why it hasn't been finished yet.
It's almost finished. The API specification was ready in 1997. It works in ncurses modulo unfixed bugs. But programs can't use it unless they use Unicode internally.
Although, if you're going to have implicit String -> [Word8] converters, there's no reason why you can't do the reverse, and have isAlpha :: Word8 -> IO Bool. Although, like ctype.h, this will only work for single-byte encodings.
We should not ignore multibyte encodings like UTF-8, which means that Haskell should have a Unicoded character type. And it's already specified in Haskell 98 that Char is such a type! What is missing is API for manipulating binary files, and conversion between byte streams and character streams using particular text encodings.
A mail client is expected to respect the encoding set in headers.
A client typically needs to know the encoding in order to display the text.
This is easier to handle when String type means Unicode.
As a counter-example, a mail *server* can do its job without paying any attention to the encodings used. It can also handle non-MIME email (which doesn't specify any encoding) regardless of the encoding.
So it should push bytes, not characters.
This is why I said "1. API for manipulating byte sequences in I/O (without representing them in String type)".
Yes. But that API also needs to include functions such as those in the Directory and System modules.
If deemed really necessary, I will not fight against them.
It isn't just about reading and writing streams. Most of the Unix API (kernel, libc, and many standard libraries) is byte-oriented rather than character-oriented.
Because they are primarily used from C, which use the older paradigm of handling text: represent it in an unspecified external encoding rather than in Unicode. OTOH newer Windows APIs use Unicode. Haskell aims at being portable. It's easier to emulate the traditional C paradigm in the Unicode paradigm than vice versa, and Haskell already tries to specify that it uses Unicode internally.
2. If you assume ISO-8859-1, you can always convert back to Word8 then re-decode as UTF-8. If you assume UTF-8, anything which is neither UTF-8 nor ASCII will fail far more severely than just getting the collation order wrong.
If I know the encoding, I should set the I/O handle to that encoding in the first place instead of reinterpreting characters which have been read using the default.
And if you don't know the encoding?
Then it's not possible to recode it to something else. But when it is possible because the encoding is known, it's easier to use a single internal encoding everywhere than to determine two encodings on each transition.
Agreed. But writing programs which support I18N, multi-byte encodings, wide character sets (>256 codepoints) and the like on an OS whose core API is byte-oriented involves work.
It's not that hard if you may sacrifice supporting every broken configuration. I did it myself, albeit without serious testing in real world situations and without trying to interface to too many libraries.
And it can't all be hidden within a library. Some of the work falls on the application programmers, who have to deal with determining the correct encoding in each situation, converting between encodings, handling encoding and decoding failures (e.g. when you encounter a Unicode filename but the terminal only has Latin1), and so on.
Indeed.
My view is that, right now, we have the worst of both worlds, and taking a short step backwards (i.e. narrow the Char type and leave the rest alone) is a lot simpler (and more feasible) than the long journey towards real I18N.
It would bury any hope in supporting a UTF-8 environment. I've heard that RedHat tried to impose UTF-8 by default. It was mostly a failure because it's too early, too many programs are not ready for it. I guess the RedHat move helped to identify some of them. But UTF-8 will inevitably be usable in future. It would be great if Haskell programs were in the group which can support it instead of being forced to be abandoned because of lack of Unicode support in the language they are written in. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/