Raw filenames vs locales

Hi all, I'd like to propose some changes to the IO library to fix some problems (IMO) with hugs' recent closer adherence to the Haskell 98 report. I believe this is orthogonal to the proposed new IO library stuff that's been discussed before. The rest of this mail is in 3 parts. First I describe the problem, then a proposed solution, and finally some comments on backwards compatibility. =========== The problem =========== With it's closer adherence to the Haskell 98 report, it is no longer possible with hugs to manipulate files using the standard IO functions if the filenames are not representable in your locale. To demonstrate the problem consider this: -------------------------------------------------- touch `printf "1\xA4"` echo ' import System.Directory (getDirectoryContents) import Data.Char (ord) main = do xs <- getDirectoryContents "." print (map (map ord) xs) ' > foo.hs for locale in en_GB en_GB.ISO-8859-15 en_GB.UTF-8 do echo "===============================" echo "Doing $locale" LC_ALL=$locale ../runhugs foo.hs done -------------------------------------------------- Here we create a file whose filename if 1\xA4. \xA4 is a "currency sign" in ISO-8859-1, "euro sign" in ISO-8859-15 and not a valid character in UTF-8. We then print the results of getDirectoryContents, converting the Chars to Ints so we can see what's going on. The result is this: =============================== Doing en_GB [[46],[46,46],[49,164],[102,111,111,46,104,115]] =============================== Doing en_GB.ISO-8859-15 [[46],[46,46],[49,8364],[102,111,111,46,104,115]] =============================== Doing en_GB.UTF-8 [[46],[46,46],[49,65533],[102,111,111,46,104,115]] The third file is the interesting one. We have: ISO-8859-1: 164 = U+A4 = "currency sign" ISO-8859-15: 8364 = U+20AC = "euro sign" UTF-8: 65533 = U+FFFD = "replacement character" "replacement character" is "used to replace an incoming character whose value is unknown or unrepresentable in Unicode". ================= Proposed solution ================= My suggestion is essentially that we change all functions using the FilePath type to instead use FilePath a => a. [ By jumping through hoops I think this could be done H98-compatibly, but for simplicity I'll ignore that for now. I'm not sure if it's a problem for any impl anyway? ] I imagine the class would look something like class FilePath a where to_filename :: a -> IO FileName from_filename :: FileName -> IO a from_free_filename :: FileName -> IO a from_free_filename f = do x <- from_filename f free f return x with_filename :: FilePath a => a -> (FileName -> IO b) -> IO b with_filename x f = do x' <- to_filename x res <- f x' free_filename x' return res We would then have System.IO.Impl.getDirContents :: FileName -> [FileName] System.IO.getDirContents :: FilePath a => a -> [a] -- Could be more general System.IO.getDirContents x = do ys <- with_filename x Impl.getDirContents mapM from_free_filename ys On Unix systems FileName would be a Ptr Word8. My knowledge of Windows isn't great, but I think there it would be an array of 16-bit values? We would have instances of FilePath for String and [Word8] to solve the immediate problem. String would be the current behaviour, but [Word8] would be converted to a FileName unchanged. On Windows it would probably be necessary to throw an exception if a [Word8] is passed which is not valid utf8. It would also be nice to have a FileName instance, to avoid unnecessary conversions. A Ptr Word8 instance would also be handy for things like darcs' FastPackedString module to be able to use efficiently (without taking a round trip via a lazy list). ======================= Backwards compatibility ======================= I haven't done any research into it, but I hope that a lot of the time this will not be an issue as the impl will be able to infer the type String is being used, either by a string literal, the fact it is putStrLn'd, there is a type signature saying it is a String, etc. The Haskell 98 modules like IO could re-export the functions with their types restricted to what they are now. This would give us complete backwards compatibility to Haskell 98. It is certainly possible for there to be ambiguities in programs that use the hierarchial libraries, however. Possible solutions are: * Tell people to add type sigs to fix it. * Define the new stuff in System.IO.Impl in a package iobase. The oldio package would then contain System.IO which re-exports all the functions with the old types, and the io package would do the same with the new types. Unfortunately i don't think this would work if you have some libraries compiled against the io package you don't want to use. I think this might be an argument that the package system is not being flexible enough. That's all I've got. Comments welcomed! Thanks Ian

Ian Lynagh wrote:
=========== The problem ===========
With it's closer adherence to the Haskell 98 report, it is no longer possible with hugs to manipulate files using the standard IO functions if the filenames are not representable in your locale.
Note that this basically means your filesystem is broken. This situation can only occur if a filesystem is written in one and then read in another locale. This "problem" cannot really be fixed, only worked around.
UTF-8: 65533 = U+FFFD = "replacement character"
================= Proposed solution =================
I have a simpler proposal: allocate 128 "replacement characters" in the "Vendor Zone" of Unicode. Their purpose is as place holders for incorrect UTF8. Then use these replacement characters when decoding UTF8 and reproduce the original, broken, code when re-encoding. Under ordinary circumstances these codes should never occur in strings.
======================= Backwards compatibility =======================
comes at no additional cost ;-) Udo. -- It's not that perl programmers are idiots, it's that the language rewards idiotic behavior in a way that no other language or tool has ever done. -- Erik Naggum

On Sat, Jul 30, 2005 at 06:13:21PM +0200, Udo Stenzel wrote:
Ian Lynagh wrote:
With it's closer adherence to the Haskell 98 report, it is no longer possible with hugs to manipulate files using the standard IO functions if the filenames are not representable in your locale.
Note that this basically means your filesystem is broken. This situation can only occur if a filesystem is written in one and then read in another locale. [...]
That is true, but on any multiuser system it's quite a reasonable scenario to have different users using different locales. It's an embarrassing scenario that I can't write a tool in Haskell that recursively deletes a directory in which there are files that aren't representable in my current locale... or display the contents of such files, or anything else.
This "problem" cannot really be fixed, only worked around.
On the contrary, the problem *can* be fixed, by only requiring that filenames be converted to unicode if necesary. For many purposes (possibly even *most* purposes), knowledge of the character encoding is completely unnecesary. More to the point, the "problem" is inherent in the langage, not the filesystem--or perhaps you'd prefer to say that it's a problem with writing portable code. The point is that it would seem best to present an API which makes it possible to write portable code. On POSIX filesystems filenames are not sequences of unicode characters, and treating them as such causes trouble.
UTF-8: 65533 = U+FFFD = "replacement character"
================= Proposed solution =================
I have a simpler proposal: allocate 128 "replacement characters" in the "Vendor Zone" of Unicode. Their purpose is as place holders for incorrect UTF8. Then use these replacement characters when decoding UTF8 and reproduce the original, broken, code when re-encoding. Under ordinary circumstances these codes should never occur in strings.
I guess you'd then want a couple of functions in the IO monad to convert between FilePath and CString (or something we could actually use)? While your suggestion would solve the problem of being unable to access some files, it would also result in FilePaths themselves (without conversion routines) being useless for purposes other than actually accessing the same files. -- David Roundy http://www.darcs.net

Hi all, Just to clarify: filenames can be written (by different users) in different locales. Therefore, one should treat filesnames as abstract entitities (sequences of bytes) since one can't sensibly convert a filename to a string (if the locale in which it was created is unknown). If the above is true, we should just treat file names as an abstract data type (FilePath) with a set of operations to break them down in smaller pieces (directory, extension etc), to append them again, and to compare them. FilePath's can be created from strings, and even be shown. But showing and creating a filepath again would not be an identity (ie: makeFilePath . show /= id). (Ian: I haven't studied your proposal in detail, but I can't see directly why you propose a separate FilePath class?) All the best, -- Daan. David Roundy wrote:
On Sat, Jul 30, 2005 at 06:13:21PM +0200, Udo Stenzel wrote:
Ian Lynagh wrote:
With it's closer adherence to the Haskell 98 report, it is no longer possible with hugs to manipulate files using the standard IO functions if the filenames are not representable in your locale.
Note that this basically means your filesystem is broken. This situation can only occur if a filesystem is written in one and then read in another locale. [...]
That is true, but on any multiuser system it's quite a reasonable scenario to have different users using different locales. It's an embarrassing scenario that I can't write a tool in Haskell that recursively deletes a directory in which there are files that aren't representable in my current locale... or display the contents of such files, or anything else.
This "problem" cannot really be fixed, only worked around.
On the contrary, the problem *can* be fixed, by only requiring that filenames be converted to unicode if necesary. For many purposes (possibly even *most* purposes), knowledge of the character encoding is completely unnecesary.
More to the point, the "problem" is inherent in the langage, not the filesystem--or perhaps you'd prefer to say that it's a problem with writing portable code. The point is that it would seem best to present an API which makes it possible to write portable code. On POSIX filesystems filenames are not sequences of unicode characters, and treating them as such causes trouble.
UTF-8: 65533 = U+FFFD = "replacement character"
================= Proposed solution =================
I have a simpler proposal: allocate 128 "replacement characters" in the "Vendor Zone" of Unicode. Their purpose is as place holders for incorrect UTF8. Then use these replacement characters when decoding UTF8 and reproduce the original, broken, code when re-encoding. Under ordinary circumstances these codes should never occur in strings.
I guess you'd then want a couple of functions in the IO monad to convert between FilePath and CString (or something we could actually use)?
While your suggestion would solve the problem of being unable to access some files, it would also result in FilePaths themselves (without conversion routines) being useless for purposes other than actually accessing the same files.
------------------------------------------------------------------------
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

On Sat, Jul 30, 2005 at 11:57:48AM -0700, Daan Leijen wrote:
(Ian: I haven't studied your proposal in detail, but I can't see directly why you propose a separate FilePath class?)
If I understand correctly, we are saying the same thing up to my System.IO.Impl, but for you this would be System.IO itself. By making System.IO a layer on top of this we get backwards compatibility with existing code (except where type sigs need to be added) and we don't need to have explicit conversion functions (between the FileName type and filenames the user enters as strings through getLine, filenames we get from a GUI library etc) throughout the code we write. Thanks Ian
participants (4)
-
Daan Leijen
-
David Roundy
-
Ian Lynagh
-
Udo Stenzel