
Glynn Clements
[Actually, regarding on-screen display, this is also an issue for Unicode. How many people actually have all of the Unicode glyphs? I certainly don't.]
If I don't have a particular character in fonts, I will not create files with it in filenames. Actually I only use 9 Polish letters in addition to ASCII, and even them rarely. Usually it's only a subset of ASCII. Some programs use UTF-8 in filenames no matter what the locale is. For example the Evolution mail program which stores mail folders as files under names the user entered in a GUI. I had to rename some of these files in order to import them to Gnus, as it choked on filenames with strange characters, never mind that it didn't display them correctly (maybe because it tried to map them to virtual newsgroup names, or maybe because they are control characters in ISO-8859-x). If all programs consistently used the locale encoding for filenames, this should have worked. When I switch my environment to UTF-8, which may happen in a few years, I will convert filenames to UTF-8 and set up mount options to translate vfat filenames to/from UTF-8 instead of to ISO-8859-2. I expect good programs to understand that and display them correctly no matter what technique they are using for the display. For example the Epiphany web browser, when I open the file:/home/users/qrczak URL, displays ISO-8859-2-encoded filenames correctly. The virtual HTML file it created from the directory listing has &x105; in its <title> where the directory name had 0xB1 in ISO-8859-2. When I run Epiphany with the locale set to pl_PL.UTF-8, it displays UTF-8 filenames correctly and ISO-8859-2 filenames are not shown at all. It's fine for me that it doesn't deal with wrongly encoded filenames, because it allowed to treat well encoded filenames correctly. For a web page rendered on the screen it makes no sense to display raw bytes. Epiphany treats filenames as sequences of characters encoded according to the locale.
And even to the extent that it can be done, it will take a long time. Outside of the Free Software ghetto, long-term backward compatibility still means a lot.
Windows has already switched most of its internals to Unicode, and it did it faster than Linux.
In CLisp it fails silently (undecodable filenames are skipped), which is bad. It should fail loudly.
No, it shouldn't fail at all.
Since it uses Unicode as string representation, accepting filenames not encoded in the locale encoding would imply making garbage from filenames correctly encoded in the locale encoding. In a UTF-8 environment character U+00E1 in the filename means bytes 0xC3 0xA1 on ext2 filesystem (and 0x00E1 on vfat filesystem), so it can't at the same time mean 0xE1 on ext2 filesystem.
And this is why I can't switch my home environment to UTF-8 yet. Too many programs are broken; almost all terminal programs which use more than stdin and stdout in default modes, i.e. which use line editing or work in full screen. How would you display a filename in a full screen text editor, such that it works in a UTF-8 environment?
So, what are you suggesting? That the whole world switches to UTF-8?
No, each computer system decides for itself, and announces it in the locale setting. I'm suggesting that programs should respect that and correctly handle all correctly encoded texts, including filenames. Better programs may offer to choose the encoding explicitly when it makes sense (e.g. text file editors for opening a file), but if they don't, they should at least accept the locale encoding.
Or that every program should pass everything through iconv() (and handle the failures)?
If it uses Unicode as internal string representation, yes (because the OS API on Unix generally uses byte encodings rather than Unicode). This should be done transparently in libraries of respective languages instead of in each program independently.
A program is not supposed to encounter filenames which are not representable in the locale's encoding.
Huh? What does "supposed to" mean in this context? That everything would be simpler if reality wasn't how it is?
It means that if it encounters a filename encoded differently, it's usually not the fault of the program but of whoever caused the mismatch in the first place.
In your setting it's impossible to display a filename in a way other than printing to stdout.
Writing to stdout doesn't amount to "displaying" anything; stdout doesn't have to be a terminal.
I know, it's not the point. The point is that other display channels than stdout connected to a terminal often work in terms of characters rather than bytes of some implicit encoding. For example various GUI frameworks, and wide character ncurses.
Sure; but that doesn't automatically mean that the locale's encoding is correct for any given filename. The point is that you often don't need to know the encoding.
What if I do need to know the encoding? I must assume something.
Converting a byte string to a character string when you're just going to be converting it back to the original byte string is pointless.
It's necessary if the channel through which the filename is transferred uses Unicode text, or bytes in some explicitly chosen encoding, rather than raw bytes in some unspecified encoding. The channel might be: - GUI API (e.g. UTF-8 for Gtk+ or UTF-16 for Qt) - X selection copied & pasted between programs, if it uses UTF-8 - email contents, if encoded differently than the filename - copy & paste between MS-DOS emulation window, which definitely uses a different encoding - database field which uses e.g. UTF-16 internally - XML file encoded in UTF-8 In overall it's better to internally use Unicode, because then only places which are inherently incapable of expressing characters outside some encoding cause loss of these characters. They should not block these characters when they are only moved between sources which can express them. I would be upset if a web browser refused to show cyrillic web pages on a graphical display only because my locale doesn't include cyrillic letters. Since the web page and the fonts may use different encodings, Unicode is a natural mediator. A design of a web browser is simpler if all its texts are kept in the same encoding, converted only at I/O, rather than if all texts have explicit encoding attached.
And it introduces unnecessary errors. If the only difference between (decode . encode) and the identity function is that the former sometimes fails, what's the point?
The point is in not having to remember the encoding of strings manipulated by the program. Encodings matter only for input and output, not for processing.
It frequently *is* an avoidable issue, because not every interface uses *any* encoding. Most of the standard Unix utilities work fine without even considering encodings.
Many of them broke because they did not consider encodings. But today 'sort' works in UTF-8 too. Those which don't have to consider encodings typically manipulate byte streams rather than text streams.
I'm not suggesting that we ignore them. I'm suggesting that we:
1. Don't provide a broken API which makes it impossible to write programs which work reliably in the real world (rather than some fantasy world where inconveniences (like filenames which don't match the locale's encoding) never happen).
It is possible in my setting. Just set the default encoding of the program to ISO-8859-1 (it should only default to the locale encoding but should be overridable from the program). But then better don't try to show filenames to the user, unless your interface is just stdin / stdout.
2. Don't force everyone to deal with all of the the complexities involved in character encoding even when they shouldn't have to.
I don't see how to have this property and at the same time make writing programs which do handle various encodings reasonably easy. With my choices all Haskell APIs use Unicode, so once libraries which interface with the world are written, the program passes strings between them without recoding. With your choices the API for filenames uses a different encoding than the API for GUI, so the conversion logic must be put in each program separately.
And, given that Unicode isn't a simple "one code, one character" system (what with composing characters), it isn't actually all that much simpler than dealing with multi-byte strings.
Composing characters are not relevant for recoding for I/O and for putting email contents on the wire. And for GUIs they are handled by already written libraries rather than by each program (e.g. Pango in Gnome on Linux).
The main advantage of Unicode for display is that there's only one encoding. Unfortunately, given that most of the existing Unicode fonts are a bit short on actual glyphs, you typically just end up converting the Unicode back into pseudo-ISO-2022 anyhow.
Again it's the problem of GUI libraries. And TTF fonts have their character set expressed in Unicode AFAIK.
So it should push bytes, not characters.
And so should a lot of software. But it helps if languages and libraries doesn't go to great lengths to try and coerce everything into characters.
It's as bad to manipulate everything in terms of bytes. Programs should generally have a choice.
OTOH newer Windows APIs use Unicode.
Haskell aims at being portable. It's easier to emulate the traditional C paradigm in the Unicode paradigm than vice versa,
I'm not entirely sure what you mean by that, but I think that I disagree. The C/Unix approach is more general; it isn't tied to any specific encoding.
If filenames were expressed as bytes in the Haskell program, how would you map them to WinAPI? If you use the current Windows code page, the set of valid characters is limited without a good reason.
It's not that hard if you may sacrifice supporting every broken configuration. I did it myself, albeit without serious testing in real world situations and without trying to interface to too many libraries.
I take it that, by "broken", you mean any string of bytes (file, string, network stream, etc) which neither explicitly specifies its encoding(s) nor uses your locale's encoding?
No - you can treat file contents as a sequence of bytes rather than a sequence of characters, and not recode them at all. In fact you have to do it anyway to avoid mangling bytes 13 and 10. Distinguishing text from binary data is not a new requirement.
If they tried a decade hence, it would still be too early. The single-byte encodings (ISO-8859-*, koi-8, win-12xx) aren't likely to be disappearing any time soon, nor is ISO-2022 (UTF-8 has quite spectacularly failed to make inroads in CJK-land; there are probably more UTF-8 users in the US than there).
Which is a pity. ISO-2022 is brain-damaged because of enormous complexity, and ISO-8859-x have small repertoires. I would not *force* UTF-8, but it should work for those who voluntarily choose to use it as their locale encoding. Including filenames.
Look, C has all of the functionality that we're talking about: wide characters, wide versions of string.h and ctype.h, and conversion between byte-streams and wide characters.
ctype.h is useless for UTF-8. There is no capability of attaching automatic recoders of explicitly chosen encodings to file handles. wchar_t is not very portable. In some systems it's UTF-32, in others it's UTF-16, and the C standard doesn't guarantee that it has anything to do with Unicode at all (I'm sure it was not Unicode on FreeBSD, I don't know if it has changed or not). The iconv API is inconvenient for converting whole strings, because the user has to allocate the output buffer and keep resizing it if it was too small. Also it's not available everywhere, sometimes an extra library needs to be installed and linked. Different C libraries use different string encodings: some use sequences of chars without an explicit encoding (perhaps the locale encoding should be assumed), some use UTF-8 (Gtk+), some use their own character type for UTF-16 (Qt, ICU), or for UTF-16 / UTF-32 depending on how they have been built (Python), some use wchar_t (curses) etc. No, the C language doesn't make these issues easy and has lots of historic baggage.
But it did it without getting in the way of writing programs which don't care about encodings,
It does get in the way of writing programs which do care, because they must do whole recoding themselves and remember which API has which character set limitations. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/