
Wolfgang Thaller wrote:
Of course, it's quite possible that the only test cases will be people using UTF-8-only (or even ASCII-only) systems, in which case you won't see any problems.
I'm kind of hoping that we can just ignore a problem that is so rare that a large and well-known project like GTK2 can get away with ignoring it.
1. The filename issues in GTK-2 are likely to be a major problem in CJK locales, where filenames which don't match the locale (which is seldom UTF-8) are common. 2. GTK's filename handling only really applies to file selector dialogs. Most other uses of filenames in a GTK-based application don't involve GTK; they use the OS API functions which just deal with byte strings. 3. GTK is a GUI library. Most of the text which it deals with is going to be rendered, so it *has* to be interpreted as characters. Treating it as blobs of data won't work. IOW, on the question of whether or not to interpret byte strings as character strings, GTK is at the far end of the scale.
Also, IIRC, Java strings are supposed to be unicode, too - how do they deal with the problem?
Files are represented by instances of the File class: http://java.sun.com/j2se/1.5.0/docs/api/java/io/File.html An abstract representation of file and directory pathnames. You can construct Files from Strings, and convert Files to Strings. The File class includes two sets of directory enumeration methods: list() returns an array of Strings, while listFiles() returns an array of Files. The documentation for the File class doesn't mention encoding issues at all. However, with that interface, it would be possible to enumerate and open filenames which cannot be decoded.
So we can't do Unicode-based I18N because there exist a few unix systems with messed-up file systems?
Declaring such systems to be "messed up" won't make the problems go away. If a design doesn't work in reality, it's the fault of the design, not of reality.
In general, yes. But we're not talking about all of reality here, we're talking about one small part of reality - the question is, can the part of reality where the design doesn't work be ignored?
Sure, you *can* ignore it; K&R C ignored everything other than ASCII. If you limit yourself to locales which use the Roman alphabet (i.e. ISO-8859-N for N=1/2/3/4/9/15), you can get away with a lot. Most such users avoid encoding issues altogether by dropping the accents and sticking to ASCII, at least when dealing with files which might leave their system. To get a better idea, you would need to consult users whose language doesn't use the roman alphabet, e.g. CJK or cyrillic. Unfortunately, you don't usually find too many of them on lists such as this. I'm only familiar with one OSS project which has a sizeable CJK user base, and that's XEmacs (whose I18N revolves around ISO-2022, and most of the documentation is in Japanese). Even there, there are separate mailing lists for English and Japanese, and the two seldom communicate.
I think that if we wait long enough, the filename encoding problems will become irrelevant and we will live in an ideal world where unicode actually works. Maybe next year, maybe only in ten years.
Maybe not even then. If Unicode really solved encoding problems, you'd
expect the CJK world to be the first adopters, but they're actually
the least eager; you are more likely to find UTF-8 in an
English-language HTML page or email message than a Japanese one.
--
Glynn Clements