Re: [Haskell] System.FilePath survey

8 Feb 2006

      Ben Rudiak-Gould wrote:
...
The point is that different things are natively handled in  
different formats under different OSes, e.g.
Posix       NT             Win9x
pathnames        bytes       UTF-16         locale
command line     bytes       UTF-16         locale
file contents    bytes       bytes          bytes
pipes/sockets    bytes       bytes          bytes
Add to that:

Mac OS X

pathnames	UTF-8
command line	UTF-8

It's POSIX (or mostly POSIX), but the encoding for path names is  
always guaranteed to be UTF-8. For the default file system type, HFS 
+, it is actually stored on disk as UTF-16. Arbitrary strings of  
bytes are not allowed.

For POSIX systems, I'd also like to observe the following:

1) Widely used languages and libraries like Java and GTK+ assume that  
all file names and command lines are encoded in the system locale, or  
at least that they can all be converted to unicode strings.

2) Command lines are usually entered as TEXT on a terminal and are  
therefore encoded in whatever encoding the terminal uses.

3) None of the recent linux distributions I have installed did  
anything but set up a UTF-8 based system.

So I think we should try hard to avoid introducing any additional  
complexity, like filename ADTs used for program arguments, to deal  
with the small minority of systems where file names cannot be  
converted to unicode. Maybe it's possible to use some user-defined  
unicode code points to achieve a lossless conversion of arbitrary  
byte strings to unicode? I mean, byte strings that are valid in the  
system encoding would get transcoded correctly, and invalid bytes  
would get mapped to some extra code points so that they can be  
converted back if necessary.

Cheers,

Wolfgang

Re: [Haskell] System.FilePath survey

Wolfgang Thaller