Raw filenames vs locales

30 Jul 2005

      Hi all,

I'd like to propose some changes to the IO library to fix some problems
(IMO) with hugs' recent closer adherence to the Haskell 98 report. I
believe this is orthogonal to the proposed new IO library stuff that's
been discussed before.

The rest of this mail is in 3 parts. First I describe the problem, then
a proposed solution, and finally some comments on backwards
compatibility.

===========
The problem
===========

With it's closer adherence to the Haskell 98 report, it is no longer
possible with hugs to manipulate files using the standard IO functions
if the filenames are not representable in your locale. To demonstrate
the problem consider this:

--------------------------------------------------
touch `printf "1\xA4"`
echo '
    import System.Directory (getDirectoryContents)
    import Data.Char (ord)
    main = do xs <- getDirectoryContents "."
              print (map (map ord) xs)
 ' > foo.hs
for locale in en_GB en_GB.ISO-8859-15 en_GB.UTF-8
do  
    echo "==============================="
    echo "Doing $locale"
    LC_ALL=$locale ../runhugs foo.hs
done
--------------------------------------------------

Here we create a file whose filename if 1\xA4. \xA4 is a
"currency sign" in ISO-8859-1, "euro sign" in ISO-8859-15 and not a
valid character in UTF-8. We then print the results of
getDirectoryContents, converting the Chars to Ints so we can see what's
going on.

The result is this:

===============================
Doing en_GB
[[46],[46,46],[49,164],[102,111,111,46,104,115]]
===============================
Doing en_GB.ISO-8859-15
[[46],[46,46],[49,8364],[102,111,111,46,104,115]]
===============================
Doing en_GB.UTF-8
[[46],[46,46],[49,65533],[102,111,111,46,104,115]]

The third file is the interesting one. We have:
ISO-8859-1:  164   = U+A4   = "currency sign"
ISO-8859-15: 8364  = U+20AC = "euro sign"
UTF-8:       65533 = U+FFFD = "replacement character"

"replacement character" is "used to replace an incoming character whose
value is unknown or unrepresentable in Unicode".

=================
Proposed solution
=================

My suggestion is essentially that we change all functions using the
FilePath type to instead use FilePath a => a.

[ By jumping through hoops I think this could be done H98-compatibly,
  but for simplicity I'll ignore that for now. I'm not sure if it's a
  problem for any impl anyway? ]

I imagine the class would look something like

class FilePath a where
    to_filename :: a -> IO FileName
    from_filename :: FileName -> IO a

from_free_filename :: FileName -> IO a
from_free_filename f = do x <- from_filename f
                          free f
                          return x

with_filename :: FilePath a => a -> (FileName -> IO b) -> IO b
with_filename x f = do x' <- to_filename x
                       res <- f x'
                       free_filename x'
                       return res

We would then have

  System.IO.Impl.getDirContents :: FileName -> [FileName]
  System.IO.getDirContents :: FilePath a => a -> [a] -- Could be more general
  System.IO.getDirContents x = do ys <- with_filename x Impl.getDirContents
                                  mapM from_free_filename ys

On Unix systems FileName would be a Ptr Word8.
My knowledge of Windows isn't great, but I think there it would be an
array of 16-bit values?

We would have instances of FilePath for String and [Word8] to solve the
immediate problem. String would be the current behaviour, but [Word8]
would be converted to a FileName unchanged. On Windows it would probably
be necessary to throw an exception if a [Word8] is passed which is not
valid utf8.

It would also be nice to have a FileName instance, to avoid unnecessary
conversions. A Ptr Word8 instance would also be handy for things like
darcs' FastPackedString module to be able to use efficiently (without
taking a round trip via a lazy list).

=======================
Backwards compatibility
=======================

I haven't done any research into it, but I hope that a lot of the time
this will not be an issue as the impl will be able to infer the type
String is being used, either by a string literal, the fact it is
putStrLn'd, there is a type signature saying it is a String, etc.

The Haskell 98 modules like IO could re-export the functions with their
types restricted to what they are now. This would give us complete
backwards compatibility to Haskell 98.

It is certainly possible for there to be ambiguities in programs that
use the hierarchial libraries, however. Possible solutions are:

* Tell people to add type sigs to fix it.

* Define the new stuff in System.IO.Impl in a package iobase.
  The oldio package would then contain System.IO which re-exports all
  the functions with the old types, and the io package would do the same
  with the new types.
  Unfortunately i don't think this would work if you have some libraries
  compiled against the io package you don't want to use. I think this
  might be an argument that the package system is not being flexible
  enough.

That's all I've got.
Comments welcomed!

Thanks
Ian

Ian Lynagh

Udo Stenzel

David Roundy

Daan Leijen

Ian Lynagh

tags

participants (4)