Re: [Haskell-cafe] ANNOUNCE: system-filepath 0.4.5 and system-fileio 0.3.4

6 Feb 2012

      On Mon, Feb 6, 2012 at 10:05, Joey Hess  wrote:
...
John Millikin wrote:
...
That was my understanding also, then QuickCheck found a
counter-example. It turns out that there are cases where a valid path
cannot be roundtripped in the GHC 7.2 encoding.
...
The issue is that  [238,189,178] decodes to 0xEF72, which is within
the 0xEF00-0xEFFF range that GHC uses to represent un-decodable bytes.
How did you deal with this in system-filepath?
I used 0xEF00 as an escape character, to mean the following char
should be interpreted as a literal byte.

A user pointed out that there is a problem with this solution also --
a path containing actual U+EF00 will be considered "invalid encoding".
I'm going to change things over to use the Python 3 solution -- they
use part of the UTF16 surrogate pair range, so it's impossible for a
valid path to contain their stand-in characters.

Another user says that GHC 7.4 also changed its escape range to match
Python 3, so it seems to be a pseudo-standard now. That's really good.
I'm going to add a 'posix_ghc704' rule to system-filepath, which
should mean that only users running GHC 7.2 will have to worry about
escape chars.

Unfortunately, the "text" package refuses to store codepoints in that
range (it replaces them with a placeholder), so I have to switch
things over to use [Char].

(Yak sighted! Prepare lather!)
...
While no code points in the Supplementary Special-purpose Plane are currently
assigned (http://www.unicode.org/roadmaps/ssp/), it is worrying that it's used,
especially if filenames in a non-unicode encoding could be interpreted as
containing characters really within this plane. I wonder why maxBound :: Char
was not increased, and the addtional space after `\1114111' used for the
un-decodable bytes?
There's probably a lot of code out there that assumes (maxBound ::
Char) is also the maximum Unicode code point. It would be difficult to
update, particularly when dealing with bindings to foreign libraries
(like the "text-icu" package).

Both Python 3 and GHC 7.4 are using codepoints in the UTF16 surrogate
pair range for this, and that seems like a pretty clean solution.
...
...
...
For FFI, anything that deals with a FilePath should use this
withFilePath, which GHC contains but doesn't export(?), rather than the
old withCString or withCAString:
import GHC.IO.Encoding (getFileSystemEncoding)
import GHC.Foreign as GHC
withFilePath :: FilePath -> (CString -> IO a) -> IO a
withFilePath fp f = getFileSystemEncoding >>= \enc -> GHC.withCString enc fp f
If code uses either withFilePort or withCString, then the filenames
                     withFilePath?
written will depend on the user's locale. This is wrong. Filenames are
either non-encoded text strings (Windows), UTF8 (OSX), or arbitrary
bytes (non-OSX POSIX). They must not change depending on the locale.
This is exactly how GHC 7.4 handles them. For example:
openDirStream :: FilePath -> IO DirStream
openDirStream name =
 withFilePath name $ \s -> do
   dirp <- throwErrnoPathIfNullRetry "openDirStream" name $ c_opendir s
   return (DirStream dirp)
removeLink :: FilePath -> IO ()
removeLink name =
 withFilePath name $ \s ->
 throwErrnoPathIfMinus1_ "removeLink" name (c_unlink s)
I do not see any locale-dependant behavior in the filename bytes read/written.
Perhaps I'm misunderstanding, but the definition of 'withFilePath' you
provided is definitely locale-dependent. Unless getFileSystemEncoding
is constant?
...
...
...
Code that reads or writes a FilePath to a Handle (including even to
stdout!) must take care to set the right encoding too:
fileEncoding :: Handle -> IO ()
fileEncoding h = hSetEncoding h =<< getFileSystemEncoding
This is also wrong. A "file path" cannot be written to a handle with
any hope of correct behavior. If it's to be displayed to the user, a
path should be converted to text first, then displayed.
Sure it can. See find(1). Its output can be read as FilePaths once the
Handle is set up as above.
If you prefer your program not crash with an encoding error when an
arbitrary FilePath is putStr, but instead perhaps output bytes that are
not valid in the current encoding, that's also a valid choice. You might
be writing a program, like find, that again needs to output any possible
FilePath including badly encoded ones.
A program like find(1) has two use cases:

1. Display paths to the user, as text.

2. Provide paths to another program, in the operating system's file path format.

These two goals are in conflict. It is not possible to implement a
find(1) that performs both correctly in all locales.

The best solution is to choose #2, and always write in the OS format,
and hope the user's shell+terminal are capable of rendering it to a
reasonable-looking path.
...
Filesystem.Path.CurrentOS.toText is a nice option if you want validly
encoded output though. Thanks for that!
Ah, that's not what toText is for. toText provides a human-readable
representation of the path. It's used for things like file managers,
where you need to show the user a label which approximates the
underlying path. There's no guarantee that the output of toText can be
converted back to the original path, especially if it returns a Left.