
On Mon, Feb 6, 2012 at 10:05, Joey Hess
John Millikin wrote:
That was my understanding also, then QuickCheck found a counter-example. It turns out that there are cases where a valid path cannot be roundtripped in the GHC 7.2 encoding.
The issue is that [238,189,178] decodes to 0xEF72, which is within the 0xEF00-0xEFFF range that GHC uses to represent un-decodable bytes.
How did you deal with this in system-filepath?
I used 0xEF00 as an escape character, to mean the following char should be interpreted as a literal byte. A user pointed out that there is a problem with this solution also -- a path containing actual U+EF00 will be considered "invalid encoding". I'm going to change things over to use the Python 3 solution -- they use part of the UTF16 surrogate pair range, so it's impossible for a valid path to contain their stand-in characters. Another user says that GHC 7.4 also changed its escape range to match Python 3, so it seems to be a pseudo-standard now. That's really good. I'm going to add a 'posix_ghc704' rule to system-filepath, which should mean that only users running GHC 7.2 will have to worry about escape chars. Unfortunately, the "text" package refuses to store codepoints in that range (it replaces them with a placeholder), so I have to switch things over to use [Char]. (Yak sighted! Prepare lather!)
While no code points in the Supplementary Special-purpose Plane are currently assigned (http://www.unicode.org/roadmaps/ssp/), it is worrying that it's used, especially if filenames in a non-unicode encoding could be interpreted as containing characters really within this plane. I wonder why maxBound :: Char was not increased, and the addtional space after `\1114111' used for the un-decodable bytes?
There's probably a lot of code out there that assumes (maxBound :: Char) is also the maximum Unicode code point. It would be difficult to update, particularly when dealing with bindings to foreign libraries (like the "text-icu" package). Both Python 3 and GHC 7.4 are using codepoints in the UTF16 surrogate pair range for this, and that seems like a pretty clean solution.
For FFI, anything that deals with a FilePath should use this withFilePath, which GHC contains but doesn't export(?), rather than the old withCString or withCAString:
import GHC.IO.Encoding (getFileSystemEncoding) import GHC.Foreign as GHC
withFilePath :: FilePath -> (CString -> IO a) -> IO a withFilePath fp f = getFileSystemEncoding >>= \enc -> GHC.withCString enc fp f
If code uses either withFilePort or withCString, then the filenames withFilePath? written will depend on the user's locale. This is wrong. Filenames are either non-encoded text strings (Windows), UTF8 (OSX), or arbitrary bytes (non-OSX POSIX). They must not change depending on the locale.
This is exactly how GHC 7.4 handles them. For example:
openDirStream :: FilePath -> IO DirStream openDirStream name = withFilePath name $ \s -> do dirp <- throwErrnoPathIfNullRetry "openDirStream" name $ c_opendir s return (DirStream dirp)
removeLink :: FilePath -> IO () removeLink name = withFilePath name $ \s -> throwErrnoPathIfMinus1_ "removeLink" name (c_unlink s)
I do not see any locale-dependant behavior in the filename bytes read/written.
Perhaps I'm misunderstanding, but the definition of 'withFilePath' you provided is definitely locale-dependent. Unless getFileSystemEncoding is constant?
Code that reads or writes a FilePath to a Handle (including even to stdout!) must take care to set the right encoding too:
fileEncoding :: Handle -> IO () fileEncoding h = hSetEncoding h =<< getFileSystemEncoding
This is also wrong. A "file path" cannot be written to a handle with any hope of correct behavior. If it's to be displayed to the user, a path should be converted to text first, then displayed.
Sure it can. See find(1). Its output can be read as FilePaths once the Handle is set up as above.
If you prefer your program not crash with an encoding error when an arbitrary FilePath is putStr, but instead perhaps output bytes that are not valid in the current encoding, that's also a valid choice. You might be writing a program, like find, that again needs to output any possible FilePath including badly encoded ones.
A program like find(1) has two use cases: 1. Display paths to the user, as text. 2. Provide paths to another program, in the operating system's file path format. These two goals are in conflict. It is not possible to implement a find(1) that performs both correctly in all locales. The best solution is to choose #2, and always write in the OS format, and hope the user's shell+terminal are capable of rendering it to a reasonable-looking path.
Filesystem.Path.CurrentOS.toText is a nice option if you want validly encoded output though. Thanks for that!
Ah, that's not what toText is for. toText provides a human-readable representation of the path. It's used for things like file managers, where you need to show the user a label which approximates the underlying path. There's no guarantee that the output of toText can be converted back to the original path, especially if it returns a Left.