
On 08.02 14:03, Wolfgang Thaller wrote:
1) Widely used languages and libraries like Java and GTK+ assume that all file names and command lines are encoded in the system locale, or at least that they can all be converted to unicode strings.
Which causes much annoyance to users having to define various environment variables just to get them to open a file.
2) Command lines are usually entered as TEXT on a terminal and are therefore encoded in whatever encoding the terminal uses.
Actually I like the ablity to delete/copy files even if they happen to have filenames in weird chinese encodings too. Users just use wildcards or tab completion to get around filenames that are hard to type.
3) None of the recent linux distributions I have installed did anything but set up a UTF-8 based system.
Very many people needing to use their own language still use other things and will continue so for the foreseeable future.
So I think we should try hard to avoid introducing any additional complexity, like filename ADTs used for program arguments, to deal with the small minority of systems where file names cannot be converted to unicode. Maybe it's possible to use some user-defined unicode code points to achieve a lossless conversion of arbitrary byte strings to unicode? I mean, byte strings that are valid in the system encoding would get transcoded correctly, and invalid bytes would get mapped to some extra code points so that they can be converted back if necessary.
What would happen if you tried to output such a String? The raw bytes or the escaped versions? Also this would mean that Haskell unicode != unicode (isn't Java's broken handling enough). - Einar Karttunen