
Glynn Clements
But this seems to be assuming a closed world. I.e. the only files which the program will ever see are those which were created by you, or by others who are compatible with your conventions.
Yes, unless you set the default encoding to Latin1.
Some programs use UTF-8 in filenames no matter what the locale is. For example the Evolution mail program which stores mail folders as files under names the user entered in a GUI.
This is entirely reasonable for a file which a program creates. If a filename is just a string of bytes, a program can use whatever encoding it wants.
But then they display wrong in any other program.
If it had just treated them as bytes, rather than trying to interpret them as characters, there wouldn't have been any problems.
I suspect it treats some characters in these synthesized newsgroup names, like dots, specially, so it won't work unless it was designed differently.
When I switch my environment to UTF-8, which may happen in a few years, I will convert filenames to UTF-8 and set up mount options to translate vfat filenames to/from UTF-8 instead of to ISO-8859-2.
But what about files which were been created by other people, who don't use UTF-8?
All people sharing a filesystem should use the same encoding. BTW, when ftping files between Windows and Unix, a good ftp client should convert filenames to keep the same characters rather than bytes, so CP-1250 encoded names don't come as garbage in the encoding used on Unix which is definitely different (ISO-8859-2 or UTF-8) or vice versa.
I expect good programs to understand that and display them correctly no matter what technique they are using for the display.
When it comes to display, you have to have to deal with encoding issues one way or another. But not all programs deal with display.
So you advocate using multiple encodings internally. This is in general more complicated than what I advocate: using only Unicode internally, limiting other encodings to I/O boundary.
Assuming that everything is UTF-8 allows a lot of potential problems to be ignored.
I don't assume UTF-8 when locale doesn't say this.
The core OS and network server applications essentially remain encoding-agnostic.
Which is a problem when they generate an email, e.g. to send a non-empty output of a cron job, or report unauthorized use of sudo. If the data involved is not pure ASCII, I will often be mangled. It's rarely a problem in practice because filenames, command arguments, error messages, user full names etc. are usually pure ASCII. But this is slowly changing.
But, as I keep pointing out, filenames are byte strings, not character strings. You shouldn't be converting them to character strings unless you have to.
Processing data in their original byte encodings makes supporting multiple languages harder. Filenames which are inexpressible as character strings get in the way of clean APIs. When considering only filenames, using bytes would be sufficient, but in overall it's more convenient to Unicodize them like other strings.
1. Actually, each user decides which locale they wish to use. Nothing forces two users of a system to use the same locale.
Locales may be different, but they should use the same encoding when they share files. This applies to file contents too - various formats don't have a fixed encoding and don't specify the encoding explicitly, so these files are assumed to be in the locale encoding.
2. Even if the locale was constant for all users on a system, there's still the (not exactly minor) issue of networking.
Depends on the networking protocols. They might insist that filenames are represented in UTF-8 for example.
Or that every program should pass everything through iconv() (and handle the failures)?
If it uses Unicode as internal string representation, yes (because the OS API on Unix generally uses byte encodings rather than Unicode).
The problem with that is that you need to *know* the source and destination encodings. The program gets to choose one of them, but it may not even know the other one.
If it can't know the encoding, it should process the data as a sequence of bytes, and can output it only to another channel which accepts raw bytes. But usually it's either known or can be assumed to be the locale encoding.
The term "mismatch" implies that there have to be at least two things. If they don't match, which one is at fault? If I make a tar file available for you to download, and it contains non-UTF-8 filenames, is that my fault or yours?
Such tarballs are not portable across systems using different encodings. If I tar a subdirectory stored on ext2 partition, and you untar it on a vfat partition, whose fault it is that files which differ only in case are conflated?
In any case, if a program refuses to deal with a file because it is cannot convert the filename to characters, even when it doesn't have to, it's the program which is at fault.
Only if it's a low-level utility, to be used in an unfriendly environment. A Haskell program in my world can do that too. Just set the encoding to Latin1.
My specific point is that the Haskell98 API has a very big problem due to the assumption that the encoding is always known. Existing implementations work around the problem by assuming that the encoding is always ISO-8859-1.
The API is incomplete and needs to be enhanced. Programs written using the current API will be limited to using the locale encoding. Just as ReadFile is limited to text files because of line endings. What do you prefer: to provide a non-Haskell98 API for binary files, or to "fix" the current API by forcing programs to use "\r\n" on Windows and "\n" on Unix manually?
If filenames were expressed as bytes in the Haskell program, how would you map them to WinAPI? If you use the current Windows code page, the set of valid characters is limited without a good reason.
Windows filenames are arguably characters rather than bytes. However, if you want to present a common API, you can just use a fixed encoding on Windows (either UTF-8 or UTF-16).
This encoding would be incompatible with most other texts seen by the program. In particular reading a filename from a file would not work without manual recoding.
Which is a pity. ISO-2022 is brain-damaged because of enormous complexity,
Or, depending upon ones perspective, Unicode is brain-damaged because, for the sake of simplicity, it over-simplifies the situation. The over-simplification is one reason for it's lack of adoption in the CJK world.
It's necessary to simplify things in order to make them usable by ordinary programs. People reject overly complicated designs even if they are in some respects more general. ISO-2022 didn't catch - about the only program I've seen which tries to fully support it is Emacs.
Multi-lingual text consists of distinct sections written in distinct languages with distinct "alphabets". It isn't actually one big chunk in a single global language with a single massive alphabet.
Multi-lingual text is almost context-insensitive. You can copy a part of it into another text, even written in another language, and it will retain its alphabet - this is much harder with stateful ISO-2022. ISO-2022 is wrong not by distinguishing alphabets but by being stateful.
and ISO-8859-x have small repertoires.
Which is one of the reasons why they are likely to persist for longer than UTF-8 "true believers" might like.
My I/O design doesn't force UTF-8, it works with ISO-8859-x as well. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/