ANNOUNCE: system-filepath 0.4.5 and system-fileio 0.3.4

Both packages now have much-improved support for non-UTF8 paths on POSIX systems. There are no significant changes to Windows support in this release. system-filepath 0.4.5: Hackage: http://hackage.haskell.org/package/system-filepath-0.4.5 API reference: https://john-millikin.com/software/haskell-filesystem/reference/system-filep... system-fileio 0.3.4: Hackage: http://hackage.haskell.org/package/system-fileio-0.3.4 API reference: https://john-millikin.com/software/haskell-filesystem/reference/system-filei... ----- In GHC 7.2 and later, file path handling in the platform libraries was changed to treat all paths as text (encoded according to locale). This does not work well on POSIX systems, because POSIX paths are byte sequences. There is no guarantee that any particular path will be valid in the user's locale encoding. system-filepath and system-fileio were modified to partially support this new behavior, but because the underlying libraries were unable to represent certain paths, they were still "broken" when built with GHC 7.2+. The changes in this release mean that they are now fully compatible (to the best of my knowledge) with GHC 7.2 and 7.4. Important changes: * system-filepath has been converted from GHC's escaping rules to its own, more compatible rules. This lets it support file paths that cannot be represented in GHC 7.2's escape format. * The POSIX layer of system-fileio has been completely rewritten to use the FFI, rather than System.Directory. This allows it to work with arbitrary POSIX paths, including those that GHC itself cannot handle. The Windows layer still uses System.Directory, since it seems to work properly. * The POSIX implementation of createTree will no longer recurse into directory symlinks that it does not have permission to remove. This is a change in behavior from the directory package's implementation. See http://www.haskell.org/pipermail/haskell-cafe/2012-January/098911.html for details and the reasoning behind the change. Since Windows does not support symlinks, I have not modified the Windows implementation (which uses removeDirectoryRecursive).

John Millikin wrote:
In GHC 7.2 and later, file path handling in the platform libraries was changed to treat all paths as text (encoded according to locale). This does not work well on POSIX systems, because POSIX paths are byte sequences. There is no guarantee that any particular path will be valid in the user's locale encoding.
I've been dealing with this change too, but my current understanding is that GHC's handling of encoding for FilePath is documented to allow "arbitrary undecodable bytes to be round-tripped through it". As long as FilePaths are read using this file system encoding, any FilePath should be usable even if it does not match the user's encoding. For FFI, anything that deals with a FilePath should use this withFilePath, which GHC contains but doesn't export(?), rather than the old withCString or withCAString: import GHC.IO.Encoding (getFileSystemEncoding) import GHC.Foreign as GHC withFilePath :: FilePath -> (CString -> IO a) -> IO a withFilePath fp f = getFileSystemEncoding >>= \enc -> GHC.withCString enc fp f Code that reads or writes a FilePath to a Handle (including even to stdout!) must take care to set the right encoding too: fileEncoding :: Handle -> IO () fileEncoding h = hSetEncoding h =<< getFileSystemEncoding
* system-filepath has been converted from GHC's escaping rules to its own, more compatible rules. This lets it support file paths that cannot be represented in GHC 7.2's escape format.
I'm dobutful about adding yet another encoding to the mix. Things are complicated enough already! And in my tests, GHC 7.4's FilePath encoding does allow arbitrary bytes in FilePaths. BTW, GHC now also has RawFilePath. Parts of System.Directory could be usefully written to support that data type too. For example, the parent directory can be determined. Other things are more difficult to do with RawFilepath. -- see shy jo

On Sun, Feb 5, 2012 at 18:49, Joey Hess
John Millikin wrote:
In GHC 7.2 and later, file path handling in the platform libraries was changed to treat all paths as text (encoded according to locale). This does not work well on POSIX systems, because POSIX paths are byte sequences. There is no guarantee that any particular path will be valid in the user's locale encoding.
I've been dealing with this change too, but my current understanding is that GHC's handling of encoding for FilePath is documented to allow "arbitrary undecodable bytes to be round-tripped through it".
As long as FilePaths are read using this file system encoding, any FilePath should be usable even if it does not match the user's encoding.
That was my understanding also, then QuickCheck found a counter-example. It turns out that there are cases where a valid path cannot be roundtripped in the GHC 7.2 encoding. -------------------------------------------------------------------------- $ ~/ghc-7.0.4/bin/ghci Prelude> writeFile ".txt" "test" Prelude> readFile ".txt" "test" Prelude> $ ~/ghc-7.2.1/bin/ghci Prelude> import System.Directory Prelude System.Directory> getDirectoryContents "." ["\61347.txt","\61347.txt","..","."] Prelude System.Directory> readFile "\61347.txt" *** Exception: .txt: openFile: does not exist (No such file or directory) Prelude System.Directory> -------------------------------------------------------------------------- The issue is that [238,189,178] decodes to 0xEF72, which is within the 0xEF00-0xEFFF range that GHC uses to represent un-decodable bytes.
For FFI, anything that deals with a FilePath should use this withFilePath, which GHC contains but doesn't export(?), rather than the old withCString or withCAString:
import GHC.IO.Encoding (getFileSystemEncoding) import GHC.Foreign as GHC
withFilePath :: FilePath -> (CString -> IO a) -> IO a withFilePath fp f = getFileSystemEncoding >>= \enc -> GHC.withCString enc fp f
If code uses either withFilePort or withCString, then the filenames written will depend on the user's locale. This is wrong. Filenames are either non-encoded text strings (Windows), UTF8 (OSX), or arbitrary bytes (non-OSX POSIX). They must not change depending on the locale.
Code that reads or writes a FilePath to a Handle (including even to stdout!) must take care to set the right encoding too:
fileEncoding :: Handle -> IO () fileEncoding h = hSetEncoding h =<< getFileSystemEncoding
This is also wrong. A "file path" cannot be written to a handle with any hope of correct behavior. If it's to be displayed to the user, a path should be converted to text first, then displayed.
* system-filepath has been converted from GHC's escaping rules to its own, more compatible rules. This lets it support file paths that cannot be represented in GHC 7.2's escape format.
I'm dobutful about adding yet another encoding to the mix. Things are complicated enough already! And in my tests, GHC 7.4's FilePath encoding does allow arbitrary bytes in FilePaths.
Unlike the GHC encoding, this encoding is entirely internal, and should not change the API's behavior.
BTW, GHC now also has RawFilePath. Parts of System.Directory could be usefully written to support that data type too. For example, the parent directory can be determined. Other things are more difficult to do with RawFilepath.
This is new in 7.4, and won't be backported, right? I tried compiling the new "unix" package in 7.2 to get proper file path support, but it failed with an error about some new language extension.

On Sun, Feb 5, 2012 at 19:17, John Millikin
-------------------------------------------------------------------------- $ ~/ghc-7.0.4/bin/ghci Prelude> writeFile ".txt" "test" Prelude> readFile ".txt" "test" Prelude>
Sorry, that got a bit mangled in the email. Corrected version: -------------------------------------------------------------------------- $ ~/ghc-7.0.4/bin/ghci Prelude> writeFile "\xA3.txt" "test" Prelude> readFile "\xA3.txt" "test" Prelude> writeFile "\xEE\xBE\xA3.txt" "test 2" Prelude> readFile "\xEE\xBE\xA3.txt" "test 2" --------------------------------------------------------------------------

John Millikin wrote:
That was my understanding also, then QuickCheck found a counter-example. It turns out that there are cases where a valid path cannot be roundtripped in the GHC 7.2 encoding.
The issue is that [238,189,178] decodes to 0xEF72, which is within the 0xEF00-0xEFFF range that GHC uses to represent un-decodable bytes.
How did you deal with this in system-filepath? While no code points in the Supplementary Special-purpose Plane are currently assigned (http://www.unicode.org/roadmaps/ssp/), it is worrying that it's used, especially if filenames in a non-unicode encoding could be interpreted as containing characters really within this plane. I wonder why maxBound :: Char was not increased, and the addtional space after `\1114111' used for the un-decodable bytes?
For FFI, anything that deals with a FilePath should use this withFilePath, which GHC contains but doesn't export(?), rather than the old withCString or withCAString:
import GHC.IO.Encoding (getFileSystemEncoding) import GHC.Foreign as GHC
withFilePath :: FilePath -> (CString -> IO a) -> IO a withFilePath fp f = getFileSystemEncoding >>= \enc -> GHC.withCString enc fp f
If code uses either withFilePort or withCString, then the filenames withFilePath? written will depend on the user's locale. This is wrong. Filenames are either non-encoded text strings (Windows), UTF8 (OSX), or arbitrary bytes (non-OSX POSIX). They must not change depending on the locale.
This is exactly how GHC 7.4 handles them. For example: openDirStream :: FilePath -> IO DirStream openDirStream name = withFilePath name $ \s -> do dirp <- throwErrnoPathIfNullRetry "openDirStream" name $ c_opendir s return (DirStream dirp) removeLink :: FilePath -> IO () removeLink name = withFilePath name $ \s -> throwErrnoPathIfMinus1_ "removeLink" name (c_unlink s) I do not see any locale-dependant behavior in the filename bytes read/written.
Code that reads or writes a FilePath to a Handle (including even to stdout!) must take care to set the right encoding too:
fileEncoding :: Handle -> IO () fileEncoding h = hSetEncoding h =<< getFileSystemEncoding
This is also wrong. A "file path" cannot be written to a handle with any hope of correct behavior. If it's to be displayed to the user, a path should be converted to text first, then displayed.
Sure it can. See find(1). Its output can be read as FilePaths once the Handle is set up as above. If you prefer your program not crash with an encoding error when an arbitrary FilePath is putStr, but instead perhaps output bytes that are not valid in the current encoding, that's also a valid choice. You might be writing a program, like find, that again needs to output any possible FilePath including badly encoded ones. Filesystem.Path.CurrentOS.toText is a nice option if you want validly encoded output though. Thanks for that!
This is new in 7.4, and won't be backported, right? I tried compiling the new "unix" package in 7.2 to get proper file path support, but it failed with an error about some new language extension.
The RawFilePath is just a ByteString, so your existing converters for that in system-filepath might work. -- see shy jo

On Mon, Feb 6, 2012 at 10:05, Joey Hess
John Millikin wrote:
That was my understanding also, then QuickCheck found a counter-example. It turns out that there are cases where a valid path cannot be roundtripped in the GHC 7.2 encoding.
The issue is that [238,189,178] decodes to 0xEF72, which is within the 0xEF00-0xEFFF range that GHC uses to represent un-decodable bytes.
How did you deal with this in system-filepath?
I used 0xEF00 as an escape character, to mean the following char should be interpreted as a literal byte. A user pointed out that there is a problem with this solution also -- a path containing actual U+EF00 will be considered "invalid encoding". I'm going to change things over to use the Python 3 solution -- they use part of the UTF16 surrogate pair range, so it's impossible for a valid path to contain their stand-in characters. Another user says that GHC 7.4 also changed its escape range to match Python 3, so it seems to be a pseudo-standard now. That's really good. I'm going to add a 'posix_ghc704' rule to system-filepath, which should mean that only users running GHC 7.2 will have to worry about escape chars. Unfortunately, the "text" package refuses to store codepoints in that range (it replaces them with a placeholder), so I have to switch things over to use [Char]. (Yak sighted! Prepare lather!)
While no code points in the Supplementary Special-purpose Plane are currently assigned (http://www.unicode.org/roadmaps/ssp/), it is worrying that it's used, especially if filenames in a non-unicode encoding could be interpreted as containing characters really within this plane. I wonder why maxBound :: Char was not increased, and the addtional space after `\1114111' used for the un-decodable bytes?
There's probably a lot of code out there that assumes (maxBound :: Char) is also the maximum Unicode code point. It would be difficult to update, particularly when dealing with bindings to foreign libraries (like the "text-icu" package). Both Python 3 and GHC 7.4 are using codepoints in the UTF16 surrogate pair range for this, and that seems like a pretty clean solution.
For FFI, anything that deals with a FilePath should use this withFilePath, which GHC contains but doesn't export(?), rather than the old withCString or withCAString:
import GHC.IO.Encoding (getFileSystemEncoding) import GHC.Foreign as GHC
withFilePath :: FilePath -> (CString -> IO a) -> IO a withFilePath fp f = getFileSystemEncoding >>= \enc -> GHC.withCString enc fp f
If code uses either withFilePort or withCString, then the filenames withFilePath? written will depend on the user's locale. This is wrong. Filenames are either non-encoded text strings (Windows), UTF8 (OSX), or arbitrary bytes (non-OSX POSIX). They must not change depending on the locale.
This is exactly how GHC 7.4 handles them. For example:
openDirStream :: FilePath -> IO DirStream openDirStream name = withFilePath name $ \s -> do dirp <- throwErrnoPathIfNullRetry "openDirStream" name $ c_opendir s return (DirStream dirp)
removeLink :: FilePath -> IO () removeLink name = withFilePath name $ \s -> throwErrnoPathIfMinus1_ "removeLink" name (c_unlink s)
I do not see any locale-dependant behavior in the filename bytes read/written.
Perhaps I'm misunderstanding, but the definition of 'withFilePath' you provided is definitely locale-dependent. Unless getFileSystemEncoding is constant?
Code that reads or writes a FilePath to a Handle (including even to stdout!) must take care to set the right encoding too:
fileEncoding :: Handle -> IO () fileEncoding h = hSetEncoding h =<< getFileSystemEncoding
This is also wrong. A "file path" cannot be written to a handle with any hope of correct behavior. If it's to be displayed to the user, a path should be converted to text first, then displayed.
Sure it can. See find(1). Its output can be read as FilePaths once the Handle is set up as above.
If you prefer your program not crash with an encoding error when an arbitrary FilePath is putStr, but instead perhaps output bytes that are not valid in the current encoding, that's also a valid choice. You might be writing a program, like find, that again needs to output any possible FilePath including badly encoded ones.
A program like find(1) has two use cases: 1. Display paths to the user, as text. 2. Provide paths to another program, in the operating system's file path format. These two goals are in conflict. It is not possible to implement a find(1) that performs both correctly in all locales. The best solution is to choose #2, and always write in the OS format, and hope the user's shell+terminal are capable of rendering it to a reasonable-looking path.
Filesystem.Path.CurrentOS.toText is a nice option if you want validly encoded output though. Thanks for that!
Ah, that's not what toText is for. toText provides a human-readable representation of the path. It's used for things like file managers, where you need to show the user a label which approximates the underlying path. There's no guarantee that the output of toText can be converted back to the original path, especially if it returns a Left.

John Millikin wrote:
Perhaps I'm misunderstanding, but the definition of 'withFilePath' you provided is definitely locale-dependent. Unless getFileSystemEncoding is constant?
I think/hope it's locale dependent, but undecodable bytes are remapped, so as long as the system's locale doesn't change, reading a FilePath with the encoding and then writing it back out should always reproduce the same bytes.
Filesystem.Path.CurrentOS.toText is a nice option if you want validly encoded output though. Thanks for that!
Ah, that's not what toText is for. toText provides a human-readable representation of the path. It's used for things like file managers, where you need to show the user a label which approximates the underlying path. There's no guarantee that the output of toText can be converted back to the original path, especially if it returns a Left.
Yes, that's what I meant. :) -- see shy jo

On Sun, Feb 05, 2012 at 07:17:32PM -0800, John Millikin wrote:
That was my understanding also, then QuickCheck found a counter-example. It turns out that there are cases where a valid path cannot be roundtripped in the GHC 7.2 encoding.
This is fixed in GHC 7.4.1. Thanks Ian

On 06/02/2012 20:32, Ian Lynagh wrote:
On Sun, Feb 05, 2012 at 07:17:32PM -0800, John Millikin wrote:
That was my understanding also, then QuickCheck found a counter-example. It turns out that there are cases where a valid path cannot be roundtripped in the GHC 7.2 encoding.
This is fixed in GHC 7.4.1.
I think we forgot to mention it in the release notes. Rountripping of
FilePath is now fully supported. The commit in question is this:
commit 7e59b6d50ec4a4400e8730bfd8cfc471c1873702
Author: Max Bolingbroke

On Tue, Feb 7, 2012 at 4:24 AM, Simon Marlow
Separately the unix package added support for undecoded FilePaths (RawFilePath), but unfortunately at the same time we started using a new extension in GHC 7.4.1 (CApiFFI), which we decided not to document because it was still experimental:
Hi, from my reading, it looks like 'capi' means from a logical perspective, "Don't assume the object is addressible, but rather that the standard c syntax for calling this routine will expand into correct code when compiled with the stated headers" So, it may be implemented by say creating a stub .c file that includes the headers and creates a wrapper around each one or when compiling via C, actually including the given headers and the function calls in the code. I ask because jhc needs such a feature (very hacky method used now, the rts knows some problematic functions and includes hacky wrappers and #defines.) and I'll make it behave just like the ghc one when possible. John

On 08/02/2012 02:26, John Meacham wrote:
On Tue, Feb 7, 2012 at 4:24 AM, Simon Marlow
wrote: Separately the unix package added support for undecoded FilePaths (RawFilePath), but unfortunately at the same time we started using a new extension in GHC 7.4.1 (CApiFFI), which we decided not to document because it was still experimental:
Hi, from my reading, it looks like 'capi' means from a logical perspective,
"Don't assume the object is addressible, but rather that the standard c syntax for calling this routine will expand into correct code when compiled with the stated headers"
So, it may be implemented by say creating a stub .c file that includes the headers and creates a wrapper around each one or when compiling via C, actually including the given headers and the function calls in the code.
Yes, that's exactly it. In GHC we create a stub (even when compiling via C, for simplicity of implementation). Cheers, Simon
I ask because jhc needs such a feature (very hacky method used now, the rts knows some problematic functions and includes hacky wrappers and #defines.) and I'll make it behave just like the ghc one when possible.
John

On Tue, Feb 07, 2012 at 06:26:48PM -0800, John Meacham wrote:
Hi, from my reading, it looks like 'capi' means from a logical perspective,
"Don't assume the object is addressible, but rather that the standard c syntax for calling this routine will expand into correct code when compiled with the stated headers"
So, it may be implemented by say creating a stub .c file that includes the headers and creates a wrapper around each one or when compiling via C, actually including the given headers and the function calls in the code.
That sounds right. It basically means you don't have to write the C stubs yourself, which is nice because (a) doing so is a pain, and (b) when the foreign import is inside 2 or 3 CPP conditionals it's even more of a pain to replicate them correctly in the C stub. Unfortunately, there are cases where C doesn't get all the type information it needs, e.g.: http://hackage.haskell.org/trac/ghc/ticket/2979#comment:14 but I'm not sure what the best fix is.
I ask because jhc needs such a feature (very hacky method used now, the rts knows some problematic functions and includes hacky wrappers and #defines.) and I'll make it behave just like the ghc one when possible.
Great! Thanks Ian

On Wed, Feb 8, 2012 at 10:56 AM, Ian Lynagh
That sounds right. It basically means you don't have to write the C stubs yourself, which is nice because (a) doing so is a pain, and (b) when the foreign import is inside 2 or 3 CPP conditionals it's even more of a pain to replicate them correctly in the C stub.
Unfortunately, there are cases where C doesn't get all the type information it needs, e.g.: http://hackage.haskell.org/trac/ghc/ticket/2979#comment:14 but I'm not sure what the best fix is.
I believe jhc's algorithm works in this case. Certain type constructors have C types associated with them, in particular, many newtypes have c types that are different than their contents. So my routine that finds out whether an argument is suitable for FFIing returns both a c type, and the underlying raw type (Int# etc..) that the type maps to. So the algorithm checks if the current type constructor has an associated C type, if it doesn't then it expands the newtype one layer and trys again, however if it does have a c type, it still recurses to get at the underlying raw type, but then replaces the c type with whatever was attached to the newtype. In the case of 'Ptr a' it recursively runs the algorithm on the argument to 'Ptr', then takes that c type and appends a '*' to it. If the argument to 'Ptr' is not an FFIable type, then it just returns HsPtr as the C type. Since CSigSet has "sigset_t" associated with it, 'Ptr CSigSet' ends up turning into 'sigset_t *' in the generated code. (Ptr (Ptr CChar)) turns into char** and so forth. An interesting quirk of this scheme is that it faithfully translates the perhaps unfortunate idiom of newtype Foo_t = Foo_t (Ptr Foo_t) into foo_t************ (an infinite chain of pointers) which is actually what the user specified. :) I added a check for recursive newtypes that chops the recursion to catch this as people seem to utilize it.
I ask because jhc needs such a feature (very hacky method used now, the rts knows some problematic functions and includes hacky wrappers and #defines.) and I'll make it behave just like the ghc one when possible.
Great!
It has now been implemented, shall be in jhc 0.8.1. John

On Thu, Feb 09, 2012 at 04:52:16AM -0800, John Meacham wrote:
Since CSigSet has "sigset_t" associated with it, 'Ptr CSigSet' ends up turning into 'sigset_t *' in the generated code. (Ptr (Ptr CChar)) turns into char** and so forth.
What does the syntax for associating sigset_t with CSigSet look like? Thanks Ian

On Thu, Feb 9, 2012 at 11:23 AM, Ian Lynagh
On Thu, Feb 09, 2012 at 04:52:16AM -0800, John Meacham wrote:
Since CSigSet has "sigset_t" associated with it, 'Ptr CSigSet' ends up turning into 'sigset_t *' in the generated code. (Ptr (Ptr CChar)) turns into char** and so forth.
What does the syntax for associating sigset_t with CSigSet look like?
There currently isn't a user accessable once, but CSigSet is included in the FFI spec so having the complier know about it isn't that bad. In fact, it is how I interpreted the standard. Otherwise, why would CFile be specified if it didn't expand 'Ptr CFile' properly. I just have a single list of associations that is easy to update at the moment, but a user defineable way is something i want in the future. My current syntax idea is. data CFile = foreign "stdio.h FILE" but it doesn't extend easily to 'newtype's or maybe a {-# CTYPE "FILE" #-} pragma... The 'Ptr' trick is useful for more than just pointers, I use the same thing to support native complex numbers. I have data Complex_ :: # -> # -- type function of unboxed types to unboxed types. then can do things like 'Complex_ Float64_' to get hardware supported complex doubles. The expansion happens just like 'Ptr' except instead of postpending '*' when it encounters _Complex, it prepends '_Complex ' (a C99 standard keyword). You can then import primitives like normal (for jhc) foreign import primitive "Add" complexPlus :: Complex_ Float64_ -> Complex_ Float64_ -> Complex_ Float64_ and lift it into a data type and add instances for the standard numeric classes if you wish. (I have macros that automate the somewhat repetitive instance creation in lib/jhc/Jhc/Num.m4) John

On Thu, Feb 09, 2012 at 11:40:28AM -0800, John Meacham wrote:
On Thu, Feb 9, 2012 at 11:23 AM, Ian Lynagh
wrote: On Thu, Feb 09, 2012 at 04:52:16AM -0800, John Meacham wrote:
Since CSigSet has "sigset_t" associated with it, 'Ptr CSigSet' ends up turning into 'sigset_t *' in the generated code. (Ptr (Ptr CChar)) turns into char** and so forth.
What does the syntax for associating sigset_t with CSigSet look like?
There currently isn't a user accessable once,
My current syntax idea is.
data CFile = foreign "stdio.h FILE"
but it doesn't extend easily to 'newtype's or maybe a {-# CTYPE "FILE" #-} pragma...
I've now implemented this in GHC. For now, the syntax is: type {-# CTYPE "some C type" #-} Foo = ... newtype {-# CTYPE "some C type" #-} Foo = ... data {-# CTYPE "some C type" #-} Foo = ... The magic for (Ptr a) is built in to the compiler. Thanks Ian

On Thu, Feb 16, 2012 at 1:20 PM, Ian Lynagh
I've now implemented this in GHC. For now, the syntax is:
type {-# CTYPE "some C type" #-} Foo = ... newtype {-# CTYPE "some C type" #-} Foo = ... data {-# CTYPE "some C type" #-} Foo = ...
The magic for (Ptr a) is built in to the compiler.
Heh. I just added it for jhc too with the exact same syntax. :) the difference is that I do not allow them for 'type' declarations, as dusugaring of types happens very early in compilation, and it feels sort of wrong to give type synonyms meaning. like I'm breaking referential transparency or something.. I also allow foreign header declarations just like with ccall. data {-# CTYPE "stdio.h FILE" #-} CFile will mean that 'stdio.h' needs to be included for FILE to be declared. John
participants (5)
-
Ian Lynagh
-
Joey Hess
-
John Meacham
-
John Millikin
-
Simon Marlow