
Also, IIRC, Java strings are supposed to be unicode, too - how do they deal with the problem?
Files are represented by instances of the File class: [...] The documentation for the File class doesn't mention encoding issues at all.
... which led me to conclude that they don't deal with the problem properly.
I think that if we wait long enough, the filename encoding problems will become irrelevant and we will live in an ideal world where unicode actually works. Maybe next year, maybe only in ten years.
Maybe not even then. If Unicode really solved encoding problems, you'd expect the CJK world to be the first adopters, but they're actually the least eager; you are more likely to find UTF-8 in an English-language HTML page or email message than a Japanese one.
Hmm, that's possibly because english-language users can get away with just marking their ASCII files as UTF-8. But I'm not arguing files or HTML pages here, I'm only concerned with filenames. I prefer unicode nowadays because I was born within a hundred kilometers of the "border" between ISO-8859-1 and ISO-8859-2. I need 8859-1 for German-language texts, but as soon as I write about where I went for vacation, I need a few 8859-2 characters. So 8-byte encodings didn't cut it, and nobody ever tried to sell ISO-2022 to me, so unicode was the only alternative. So you've now convinced me that there is a considerable number of computers using ISO-2022, where there's more than one way to encode the same text (how do people use this from the command line??). There is also multi-user systems where the user's don't agree on a single encoding. I still reserve the right to call those systems messed-up, but that's just my personal opinion and "reality" couldn't care less about what I think. So, as I don't want to stick with the status quo forever (lists of bytes that pretend to be lists of unicode chars, even on platforms where unicode is used anyway), how about we get to work - what do we want? I don't think we want a type class here, a plain (abstract) data type will do:
data File
Obviously, we'll need conversion from and to C strings. On Mac OS X, they'd be guaranteed to be in UTF-8.
withFilePathCString :: String -> (CString -> IO a) -> IO a fileFromCString :: CString -> IO File
We will need functions for converting to and from unicode strings. I'm pretty sure that we want to keep those functions pure, otherwise they'll be very annoying to use.
fileFromPath :: String -> File
Any impure operations that might be needed to decide how to encode the file name will have to be delayed until the File is actually used.
fileToPath :: File -> String
Same here: any impure operation necessary to convert the File to a unicode string needs to be done when the file is created. What about failure? If you go from String to File, errors should be reported when you actually access the file. At an earlier time, you can't know whether the file name is valid (e.g. if you mount a "classic" HFS volume on Mac OS X, you can only create files there whose names can be represented in the volume's file name encoding - but you only find that out once you try to create a file). For going from File to String, I'm not so sure, but I would be very annoyed if I had to deal with a Maybe String return type on platforms where it will always succeed. Maybe there should be separate functions for different purposes - i.e. for display, you'd use a File -> String function that will silently use '?'s when things can't be decoded, but in other situations you might use a File -> Maybe String function and check for Nothing. If people want to implement more sophisticated ways of decoding file names than can be provided by the library, they'd get the C string and do the same things. Of course, there should also be lots of other useful functions that make it more or less unnecessary to deal with path names directly in most cases. Thoughts? Cheers, Wolfgang