
Ben Rudiak-Gould wrote:
The point is that different things are natively handled in different formats under different OSes, e.g.
Posix NT Win9x
pathnames bytes UTF-16 locale command line bytes UTF-16 locale file contents bytes bytes bytes pipes/sockets bytes bytes bytes
Add to that: Mac OS X pathnames UTF-8 command line UTF-8 It's POSIX (or mostly POSIX), but the encoding for path names is always guaranteed to be UTF-8. For the default file system type, HFS +, it is actually stored on disk as UTF-16. Arbitrary strings of bytes are not allowed. For POSIX systems, I'd also like to observe the following: 1) Widely used languages and libraries like Java and GTK+ assume that all file names and command lines are encoded in the system locale, or at least that they can all be converted to unicode strings. 2) Command lines are usually entered as TEXT on a terminal and are therefore encoded in whatever encoding the terminal uses. 3) None of the recent linux distributions I have installed did anything but set up a UTF-8 based system. So I think we should try hard to avoid introducing any additional complexity, like filename ADTs used for program arguments, to deal with the small minority of systems where file names cannot be converted to unicode. Maybe it's possible to use some user-defined unicode code points to achieve a lossless conversion of arbitrary byte strings to unicode? I mean, byte strings that are valid in the system encoding would get transcoded correctly, and invalid bytes would get mapped to some extra code points so that they can be converted back if necessary. Cheers, Wolfgang