
Wolfgang Thaller wrote:
If you try to pretend that I18N comes down to shoe-horning everything into Unicode, you will turn the language into a joke.
How common will those problems you are describing be by the time this has been implemented? How common are they even now?
Right now, GHC assumes ISO-8859-1 whenever it has to automatically convert between String and CString. Conversions to and from ISO-8859-1 cannot fail, and encoding and decoding are exact inverses. OK, so the intermediate string will be nonsense if ISO-8859-1 isn't the correct encoding, but that doesn't actually matter a lot of the time; frequently, you're just grabbing a "blob" of data from one function and passing it to another. The problems will only appear once you start dealing with fallible or non-reversible encodings such as UTF-8 or ISO-2022. If and when that happens, I guess we'll find out how common the problems are. Of course, it's quite possible that the only test cases will be people using UTF-8-only (or even ASCII-only) systems, in which case you won't see any problems.
I haven't yet encountered a unix box where the file names were not in the system locale encoding. On all reasonably up-to-date Linux boxes that I've seen recently, they were in UTF-8 (and the system locale agreed).
I've encountered boxes where multiple encodings were used; primarily web and FTP servers which were shared amongst multiple clients. Each client used whichever encoding(s) they felt like. IIRC, the most common non-ASCII encoding was MS-DOS codepage 850 (the clients were mostly using Windows 3.1 at that time). I haven't done sysadmin for a while, so I don't know the current situation, but I don't think that the world has switched to UTF-8 in the mean time. [Most of the non-ASCII filenames which I've seen recently have been either ISO-8859-1 or Win-12XX; I haven't seen much UTF-8.]
On both Windows and Mac OS X, filenames are stored in Unicode, so it is always possible to convert them to unicode. So we can't do Unicode-based I18N because there exist a few unix systems with messed-up file systems?
Declaring such systems to be "messed up" won't make the problems go away. If a design doesn't work in reality, it's the fault of the design, not of reality.
Haskell's Unicode support is a joke because the API designers tried to avoid the issues related to encoding with wishful thinking (i.e. you open a file and you magically get Unicode characters out of it).
OK, that part is purely wishful thinking, but assuming that filenames are text that can be represented in Unicode is wishful thinking that corresponds to 99% of reality. So why can't the remaining 1 percent of reality be fixed instead?
The issue isn't whether the data can be represented as Unicode text,
but whether you can convert it to and from Unicode without problems.
To do this, you need to know the encoding, you need to store the
encoding so that you can convert the wide string back to a byte
string, and the encoding needs to be reversible.
--
Glynn Clements