
Marcin 'Qrczak' Kowalczyk wrote:
When I switch my environment to UTF-8, which may happen in a few years, I will convert filenames to UTF-8 and set up mount options to translate vfat filenames to/from UTF-8 instead of to ISO-8859-2.
But what about files which were been created by other people, who don't use UTF-8?
All people sharing a filesystem should use the same encoding.
Again, this is just "hand waving" the issues away.
BTW, when ftping files between Windows and Unix, a good ftp client should convert filenames to keep the same characters rather than bytes, so CP-1250 encoded names don't come as garbage in the encoding used on Unix which is definitely different (ISO-8859-2 or UTF-8) or vice versa.
Which is fine if the FTP client can figure out which encoding is used on the remote end. In practice, you have to tell it, i.e. have a list of which servers (or even which directories on which servers) use which encoding.
I expect good programs to understand that and display them correctly no matter what technique they are using for the display.
When it comes to display, you have to have to deal with encoding issues one way or another. But not all programs deal with display.
So you advocate using multiple encodings internally. This is in general more complicated than what I advocate: using only Unicode internally, limiting other encodings to I/O boundary.
How do you draw that conclusion from what I wrote here? There are cases where it's advantages to use multiple encodings, but I wasn't suggesting that in the above. What I'm suggesting in the above is to sidestep the encoding issue by keeping filenames as byte strings wherever possible.
The core OS and network server applications essentially remain encoding-agnostic.
Which is a problem when they generate an email, e.g. to send a non-empty output of a cron job, or report unauthorized use of sudo. If the data involved is not pure ASCII, I will often be mangled.
It only gets mangled if you feed it to a program which is making assumptions about the encoding. Non-MIME messages neither specify nor imply an encoding. MIME messages can use either "text/plain; charset=x-unknown" or application/octet-stream if they don't undertand the encoding. And program-generated email notifications frequently include text with no known encoding (i.e. binary data). Or are you going to demand that anyone who tries to hack into your system only sends it UTF-8 data so that the alert messages are displayed correctly in your mail program?
It's rarely a problem in practice because filenames, command arguments, error messages, user full names etc. are usually pure ASCII. But this is slowly changing.
To the extent that non-ASCII filenames are used, I've encountered far more filenames in both Latin1 and ISO-2022 than in UTF-8. Japanese FTP sites typically use ISO-2022 for everything; even ASCII names may have "\e(B" prepended to them.
But, as I keep pointing out, filenames are byte strings, not character strings. You shouldn't be converting them to character strings unless you have to.
Processing data in their original byte encodings makes supporting multiple languages harder. Filenames which are inexpressible as character strings get in the way of clean APIs. When considering only filenames, using bytes would be sufficient, but in overall it's more convenient to Unicodize them like other strings.
It also harms reliability. Depending upon the encoding, two distinct byte strings may have the same Unicode representation. E.g. if you are interfacing to a server which uses ISO-2022 for filenames, you have to get the escapes correct even when they are no-ops in terms of the string representation. If you obtain a directory listing, receive the filename "\e(Bfoo.txt", and convert it to Unicode, you get "foo.txt". If you then convert it back without the leading escape, the server is going to say "file not found".
The term "mismatch" implies that there have to be at least two things. If they don't match, which one is at fault? If I make a tar file available for you to download, and it contains non-UTF-8 filenames, is that my fault or yours?
Such tarballs are not portable across systems using different encodings.
Well, programs which treat filenames as byte strings to be read from argv[] and passed directly to open() won't have any problems with this. It's only a problem if you make it a problem.
If I tar a subdirectory stored on ext2 partition, and you untar it on a vfat partition, whose fault it is that files which differ only in case are conflated?
Arguably, it's Microsoft's fault for not considering the problems caused by multiple encodings when they decided that filenames were going to be case-folded.
In any case, if a program refuses to deal with a file because it is cannot convert the filename to characters, even when it doesn't have to, it's the program which is at fault.
Only if it's a low-level utility, to be used in an unfriendly environment.
A Haskell program in my world can do that too. Just set the encoding to Latin1.
But programs should handle this by default, IMHO. Filenames are, for the most part, just "tokens" to be passed around. You get a value from argv[], and pass it to open() or whatever. It doesn't need to have any meaning.
My specific point is that the Haskell98 API has a very big problem due to the assumption that the encoding is always known. Existing implementations work around the problem by assuming that the encoding is always ISO-8859-1.
The API is incomplete and needs to be enhanced. Programs written using the current API will be limited to using the locale encoding.
That just adds unnecessary failure modes.
Just as ReadFile is limited to text files because of line endings. What do you prefer: to provide a non-Haskell98 API for binary files, or to "fix" the current API by forcing programs to use "\r\n" on Windows and "\n" on Unix manually?
That's a harder case. There is a good reason for auto-converting EOL, as most programs actually process file contents. Most programs don't "process" filenames; they just pass them around.
If filenames were expressed as bytes in the Haskell program, how would you map them to WinAPI? If you use the current Windows code page, the set of valid characters is limited without a good reason.
Windows filenames are arguably characters rather than bytes. However, if you want to present a common API, you can just use a fixed encoding on Windows (either UTF-8 or UTF-16).
This encoding would be incompatible with most other texts seen by the program. In particular reading a filename from a file would not work without manual recoding.
We already have that problem; you can't read non-Latin1 strings from files. In some regards, the problem is worse on Windows, because of the prevalence of non-ASCII text (Windows 12xx and "smart" quotes), so using UTF-8 for file contents on Windows is even harder.
Which is a pity. ISO-2022 is brain-damaged because of enormous complexity,
Or, depending upon ones perspective, Unicode is brain-damaged because, for the sake of simplicity, it over-simplifies the situation. The over-simplification is one reason for it's lack of adoption in the CJK world.
It's necessary to simplify things in order to make them usable by ordinary programs. People reject overly complicated designs even if they are in some respects more general.
ISO-2022 didn't catch - about the only program I've seen which tries to fully support it is Emacs.
And X. Compound text is ISO-2022. For commercial X software, Motif (which uses compound text) is still the most widely-used toolkit. But, then, the fact that you haven't seen many ISO-2022 programs is probably because you're used to using programs developed by and for Westerners. In the far East, ISO-2022 is by far the most popular encoding. There, you could realistically ignore all other encodings. BTW, that's why Emacs (and XEmacs) support ISO-2022 much better than they do UTF-8. Because MuLE was written by Japanese developers.
Multi-lingual text consists of distinct sections written in distinct languages with distinct "alphabets". It isn't actually one big chunk in a single global language with a single massive alphabet.
Multi-lingual text is almost context-insensitive. You can copy a part of it into another text, even written in another language, and it will retain its alphabet - this is much harder with stateful ISO-2022.
ISO-2022 is wrong not by distinguishing alphabets but by being stateful.
Sure, the statefulness adds complexity (which is one of the reasons so many people prefer to work with UTF-8), but it has the benefit of providing distinct markers to indicate where the character set is being switched (that isn't a compelling advantage; you could reconstruct the markers if you could uniquely determine the character set for each character). OTOH, Unicode is wrong by not distinguishing character sets. This is a significant reason why it hasn't been adopt in the far East (specifically, Han unification).
and ISO-8859-x have small repertoires.
Which is one of the reasons why they are likely to persist for longer than UTF-8 "true believers" might like.
My I/O design doesn't force UTF-8, it works with ISO-8859-x as well.
But I was specifically addressing Unicode versus multiple encodings
internally. The size of the Unicode "alphabet" effectively prohibits
using codepoints as indices.
--
Glynn Clements