Re: [Haskell-cafe] invalid character encoding

older
Re: [Haskell] State, StateT and...

Wolfgang Thaller

18 Mar 2005 18 Mar '05

4:16 a.m.

If you try to pretend that I18N comes down to shoe-horning everything into Unicode, you will turn the language into a joke.

How common will those problems you are describing be by the time this has been implemented? How common are they even now? I haven't yet encountered a unix box where the file names were not in the system locale encoding. On all reasonably up-to-date Linux boxes that I've seen recently, they were in UTF-8 (and the system locale agreed). On both Windows and Mac OS X, filenames are stored in Unicode, so it is always possible to convert them to unicode. So we can't do Unicode-based I18N because there exist a few unix systems with messed-up file systems?

...

Haskell's Unicode support is a joke because the API designers tried to avoid the issues related to encoding with wishful thinking (i.e. you open a file and you magically get Unicode characters out of it).

OK, that part is purely wishful thinking, but assuming that filenames are text that can be represented in Unicode is wishful thinking that corresponds to 99% of reality. So why can't the remaining 1 percent of reality be fixed instead? Cheers, Wolfgang

Show replies by date

Glynn Clements

18 Mar 18 Mar

7 p.m.

New subject: invalid character encoding

Wolfgang Thaller wrote:

...

...
If you try to pretend that I18N comes down to shoe-horning everything into Unicode, you will turn the language into a joke.

How common will those problems you are describing be by the time this has been implemented? How common are they even now?

Right now, GHC assumes ISO-8859-1 whenever it has to automatically convert between String and CString. Conversions to and from ISO-8859-1 cannot fail, and encoding and decoding are exact inverses. OK, so the intermediate string will be nonsense if ISO-8859-1 isn't the correct encoding, but that doesn't actually matter a lot of the time; frequently, you're just grabbing a "blob" of data from one function and passing it to another. The problems will only appear once you start dealing with fallible or non-reversible encodings such as UTF-8 or ISO-2022. If and when that happens, I guess we'll find out how common the problems are. Of course, it's quite possible that the only test cases will be people using UTF-8-only (or even ASCII-only) systems, in which case you won't see any problems.

...

I haven't yet encountered a unix box where the file names were not in the system locale encoding. On all reasonably up-to-date Linux boxes that I've seen recently, they were in UTF-8 (and the system locale agreed).

I've encountered boxes where multiple encodings were used; primarily web and FTP servers which were shared amongst multiple clients. Each client used whichever encoding(s) they felt like. IIRC, the most common non-ASCII encoding was MS-DOS codepage 850 (the clients were mostly using Windows 3.1 at that time). I haven't done sysadmin for a while, so I don't know the current situation, but I don't think that the world has switched to UTF-8 in the mean time. [Most of the non-ASCII filenames which I've seen recently have been either ISO-8859-1 or Win-12XX; I haven't seen much UTF-8.]

...

On both Windows and Mac OS X, filenames are stored in Unicode, so it is always possible to convert them to unicode. So we can't do Unicode-based I18N because there exist a few unix systems with messed-up file systems?

Declaring such systems to be "messed up" won't make the problems go away. If a design doesn't work in reality, it's the fault of the design, not of reality.

...

...
Haskell's Unicode support is a joke because the API designers tried to avoid the issues related to encoding with wishful thinking (i.e. you open a file and you magically get Unicode characters out of it).

OK, that part is purely wishful thinking, but assuming that filenames are text that can be represented in Unicode is wishful thinking that corresponds to 99% of reality. So why can't the remaining 1 percent of reality be fixed instead?

The issue isn't whether the data can be represented as Unicode text, but whether you can convert it to and from Unicode without problems. To do this, you need to know the encoding, you need to store the encoding so that you can convert the wide string back to a byte string, and the encoding needs to be reversible. -- Glynn Clements

Wolfgang Thaller

19 Mar 19 Mar

6:10 a.m.

New subject: invalid character encoding

Glynn Clements wrote:

...

OK, so the intermediate string will be nonsense if ISO-8859-1 isn't the correct encoding, but that doesn't actually matter a lot of the time; frequently, you're just grabbing a "blob" of data from one function and passing it to another.

Yes. Of course, this also means that Strings representing non-ASCII filenames will *always* be nonsense on Mac OS X and other UTF8-based platforms.

...

The problems will only appear once you start dealing with fallible or non-reversible encodings such as UTF-8 or ISO-2022.

In what way is ISO-2022 non-reversible? Is it possible that a ISO-2022 file name that is converted to Unicode cannot be converted back any more (assuming you know for sure that it was ISO-2022 in the first place)?

...

Of course, it's quite possible that the only test cases will be people using UTF-8-only (or even ASCII-only) systems, in which case you won't see any problems.

I'm kind of hoping that we can just ignore a problem that is so rare that a large and well-known project like GTK2 can get away with ignoring it. Also, IIRC, Java strings are supposed to be unicode, too - how do they deal with the problem?

...

...
So we can't do Unicode-based I18N because there exist a few unix systems with messed-up file systems?

Declaring such systems to be "messed up" won't make the problems go away. If a design doesn't work in reality, it's the fault of the design, not of reality.

In general, yes. But we're not talking about all of reality here, we're talking about one small part of reality - the question is, can the part of reality where the design doesn't work be ignored? For example, as soon as we use any kind of path names in our APIs, we are ignoring reality on good old "Classic" Mac OS (may it rest in piece). Path names don't always uniquely denote a file there (although they do most of the time). People writing cross-platform software have been ignoring this fact for a long time now. I think that if we wait long enough, the filename encoding problems will become irrelevant and we will live in an ideal world where unicode actually works. Maybe next year, maybe only in ten years. And while we are arguing about how far we are from that ideal world, we should think about alternatives. The current hack is really just a hack, and I don't want to see this hack become the new accepted standard. Do we have other alternatives? Preferably something that provides other advantages over a unicode String than just making things work on systems that many users never encounter, otherwise almost no one will bother to use it. So maybe we should start looking for _other_ reasons to represent file names and paths by an abstract datatype or something? Cheers, Wolfgang

Einar Karttunen

9:36 a.m.

New subject: invalid character encoding

Wolfgang Thaller writes:

...

In what way is ISO-2022 non-reversible? Is it possible that a ISO-2022 file name that is converted to Unicode cannot be converted back any more (assuming you know for sure that it was ISO-2022 in the first place)?

I am no expert on ISO-2022 so the following may contain errors, please correct if it is wrong. ISO-2022 -> Unicode is always possible. Also Unicode -> ISO-2022 should be always possible, but is a relation not a function. This means there are an infinite? ways of encoding a particular unicode string in ISO-2022. ISO-2022 works by providing escape sequences to switch between different character sets. One can freely use these escapes in almost any way you wish. Also ISO-2022 makes a difference between the same character in japanese/chinese/korean - which unicode does not do. See here for more info on the topic: http://www.ecma-international.org/publications/files/ecma-st/ECMA-035.pdf Also trusting system locale for everything is problematic and makes things quite unbearable for I18N. e.g. on my desktop 95% of things run with iso-8859-1, 3% of things use utf-8 and a few apps use EUC-JP... Using filenames as opaque blobs causes the least problems. If the program wishes to display them in a graphical environment then they have to be converted to a string, but very many apps never display the filenames... - Einar Karttunen

Glynn Clements

2:32 p.m.

New subject: invalid character encoding

Einar Karttunen wrote:

...

...
In what way is ISO-2022 non-reversible? Is it possible that a ISO-2022 file name that is converted to Unicode cannot be converted back any more (assuming you know for sure that it was ISO-2022 in the first place)?

I am no expert on ISO-2022 so the following may contain errors, please correct if it is wrong.

ISO-2022 -> Unicode is always possible. Also Unicode -> ISO-2022 should be always possible, but is a relation not a function. This means there are an infinite? ways of encoding a particular unicode string in ISO-2022.

ISO-2022 works by providing escape sequences to switch between different character sets. One can freely use these escapes in almost any way you wish.

Exactly. Moreover, while there are an infinite number of equivalent representations in theory (you can add as many redundant switching sequences as you wish), there are multiple "plausible" equivalent representations in practice. -- Glynn Clements

Glynn Clements

5:34 p.m.

New subject: invalid character encoding

Wolfgang Thaller wrote:

...

...
Of course, it's quite possible that the only test cases will be people using UTF-8-only (or even ASCII-only) systems, in which case you won't see any problems.

I'm kind of hoping that we can just ignore a problem that is so rare that a large and well-known project like GTK2 can get away with ignoring it.

1. The filename issues in GTK-2 are likely to be a major problem in CJK locales, where filenames which don't match the locale (which is seldom UTF-8) are common. 2. GTK's filename handling only really applies to file selector dialogs. Most other uses of filenames in a GTK-based application don't involve GTK; they use the OS API functions which just deal with byte strings. 3. GTK is a GUI library. Most of the text which it deals with is going to be rendered, so it *has* to be interpreted as characters. Treating it as blobs of data won't work. IOW, on the question of whether or not to interpret byte strings as character strings, GTK is at the far end of the scale.

...

Also, IIRC, Java strings are supposed to be unicode, too - how do they deal with the problem?

Files are represented by instances of the File class: http://java.sun.com/j2se/1.5.0/docs/api/java/io/File.html An abstract representation of file and directory pathnames. You can construct Files from Strings, and convert Files to Strings. The File class includes two sets of directory enumeration methods: list() returns an array of Strings, while listFiles() returns an array of Files. The documentation for the File class doesn't mention encoding issues at all. However, with that interface, it would be possible to enumerate and open filenames which cannot be decoded.

...

...
...
So we can't do Unicode-based I18N because there exist a few unix systems with messed-up file systems?

Declaring such systems to be "messed up" won't make the problems go away. If a design doesn't work in reality, it's the fault of the design, not of reality.

In general, yes. But we're not talking about all of reality here, we're talking about one small part of reality - the question is, can the part of reality where the design doesn't work be ignored?

Sure, you *can* ignore it; K&R C ignored everything other than ASCII. If you limit yourself to locales which use the Roman alphabet (i.e. ISO-8859-N for N=1/2/3/4/9/15), you can get away with a lot. Most such users avoid encoding issues altogether by dropping the accents and sticking to ASCII, at least when dealing with files which might leave their system. To get a better idea, you would need to consult users whose language doesn't use the roman alphabet, e.g. CJK or cyrillic. Unfortunately, you don't usually find too many of them on lists such as this. I'm only familiar with one OSS project which has a sizeable CJK user base, and that's XEmacs (whose I18N revolves around ISO-2022, and most of the documentation is in Japanese). Even there, there are separate mailing lists for English and Japanese, and the two seldom communicate.

...

I think that if we wait long enough, the filename encoding problems will become irrelevant and we will live in an ideal world where unicode actually works. Maybe next year, maybe only in ten years.

Maybe not even then. If Unicode really solved encoding problems, you'd expect the CJK world to be the first adopters, but they're actually the least eager; you are more likely to find UTF-8 in an English-language HTML page or email message than a Japanese one. -- Glynn Clements

Wolfgang Thaller

11:56 p.m.

New subject: invalid character encoding

...

...
Also, IIRC, Java strings are supposed to be unicode, too - how do they deal with the problem?

Files are represented by instances of the File class: [...] The documentation for the File class doesn't mention encoding issues at all.

... which led me to conclude that they don't deal with the problem properly.

...

...
I think that if we wait long enough, the filename encoding problems will become irrelevant and we will live in an ideal world where unicode actually works. Maybe next year, maybe only in ten years.

Maybe not even then. If Unicode really solved encoding problems, you'd expect the CJK world to be the first adopters, but they're actually the least eager; you are more likely to find UTF-8 in an English-language HTML page or email message than a Japanese one.

Hmm, that's possibly because english-language users can get away with just marking their ASCII files as UTF-8. But I'm not arguing files or HTML pages here, I'm only concerned with filenames. I prefer unicode nowadays because I was born within a hundred kilometers of the "border" between ISO-8859-1 and ISO-8859-2. I need 8859-1 for German-language texts, but as soon as I write about where I went for vacation, I need a few 8859-2 characters. So 8-byte encodings didn't cut it, and nobody ever tried to sell ISO-2022 to me, so unicode was the only alternative. So you've now convinced me that there is a considerable number of computers using ISO-2022, where there's more than one way to encode the same text (how do people use this from the command line??). There is also multi-user systems where the user's don't agree on a single encoding. I still reserve the right to call those systems messed-up, but that's just my personal opinion and "reality" couldn't care less about what I think. So, as I don't want to stick with the status quo forever (lists of bytes that pretend to be lists of unicode chars, even on platforms where unicode is used anyway), how about we get to work - what do we want? I don't think we want a type class here, a plain (abstract) data type will do:

...

data File

Obviously, we'll need conversion from and to C strings. On Mac OS X, they'd be guaranteed to be in UTF-8.

...

withFilePathCString :: String -> (CString -> IO a) -> IO a fileFromCString :: CString -> IO File

We will need functions for converting to and from unicode strings. I'm pretty sure that we want to keep those functions pure, otherwise they'll be very annoying to use.

...

fileFromPath :: String -> File

Any impure operations that might be needed to decide how to encode the file name will have to be delayed until the File is actually used.

...

fileToPath :: File -> String

Same here: any impure operation necessary to convert the File to a unicode string needs to be done when the file is created. What about failure? If you go from String to File, errors should be reported when you actually access the file. At an earlier time, you can't know whether the file name is valid (e.g. if you mount a "classic" HFS volume on Mac OS X, you can only create files there whose names can be represented in the volume's file name encoding - but you only find that out once you try to create a file). For going from File to String, I'm not so sure, but I would be very annoyed if I had to deal with a Maybe String return type on platforms where it will always succeed. Maybe there should be separate functions for different purposes - i.e. for display, you'd use a File -> String function that will silently use '?'s when things can't be decoded, but in other situations you might use a File -> Maybe String function and check for Nothing. If people want to implement more sophisticated ways of decoding file names than can be provided by the library, they'd get the C string and do the same things. Of course, there should also be lots of other useful functions that make it more or less unnecessary to deal with path names directly in most cases. Thoughts? Cheers, Wolfgang

Dimitry Golubovsky

20 Mar 20 Mar

4:13 a.m.

New subject: invalid character encoding

Glynn Clements wrote:

...

To get a better idea, you would need to consult users whose language doesn't use the roman alphabet, e.g. CJK or cyrillic. Unfortunately, you don't usually find too many of them on lists such as this.

In Russia, we still have multiple one byte encodings for Cyrillic: KOI-8 (Unix), CP1251 (Windows), and getting more and more obsolete CP866 (MSDOS, OS/2). Regarding filenames, I am sure Windows stores them in Unicode regarding of locale (I tried various chcp numbers in a console window, printing directory containing filenames in Russian and in German altogether, and it showed "non-characters" as question marks when locale-based codepage was set, and showed everything with chcp 65001 which is Unicode). AFAIK Unix users do not create files named in Russian very often, and Windows users do this frequently. Dimitry Golubovsky Middletown, CT

Marcin 'Qrczak' Kowalczyk

19 Mar 19 Mar

6:18 p.m.

New subject: invalid character encoding

Wolfgang Thaller writes:

...

Also, IIRC, Java strings are supposed to be unicode, too - how do they deal with the problem?

Java (Sun) ---------- Filenames are assumed to be in the locale encoding. a) Interpreting. Bytes which cannot be converted are replaced by U+FFFD. b) Creating. Characters which cannot be converted are replaced by "?". Command line arguments and standard I/O are treated in the same way. Java (GNU) ---------- Filenames are assumed to be in Java-modified UTF-8. a) Interpreting. If a filename cannot be converted, a directory listing contains a null instead of a string object. b) Creating. All Java characters are representable in Java-modified UTF-8. Obviously not all potential filenames can be represented. Command line arguments are interpreted according to the locale. Bytes which cannot be converted are skipped. Standard I/O works in ISO-8859-1 by default. Obviously all input is accepted. On output characters above U+00FF are replaced by "?". C# (mono) --------- Filenames use the list of encodings from the MONO_EXTERNAL_ENCODINGS environment variable, with UTF-8 implicitly added at the end. These encodings are tried in order. a) Interpreting. If a filename cannot be converted, it's skipped in a directory listing. The documentation says that if a filename, a command line argument etc. looks like valid UTF-8, it is treated as such first, and MONO_EXTERNAL_ENCODINGS is consulted only in remaining cases. The reality seems to not match this (mono-1.0.5). b) Creating. If UTF-8 is used, U+0000 throws an exception (System.ArgumentException: Path contains invalid chars), paired surrogates are treated correctly, and an isolated surrogate causes an internal error: ** ERROR **: file strenc.c: line 161 (mono_unicode_to_external): assertion failed: (utf8!=NULL) aborting... Command line arguments are treated in the same way, except that if an argument cannot be converted, the program dies at start: [Invalid UTF-8] Cannot determine the text encoding for argument 1 (xxx\xb1\xe6\xea). Please add the correct encoding to MONO_EXTERNAL_ENCODINGS and try again. Console.WriteLine emits UTF-8. Paired surrogates are treated correctly, unpaired surrogates are converted to pseudo-UTF-8. Console.ReadLine interprets text as UTF-8. Bytes which cannot be converted are skipped. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

7413

Age (days ago)

7415

Last active (days ago)

List overview

Download

8 comments

5 participants

participants (5)

Dimitry Golubovsky
Einar Karttunen
Glynn Clements
Marcin 'Qrczak' Kowalczyk
Wolfgang Thaller