
Hi, I'm just investigating what we can do about a problem with darcs' handling of non-ASCII filenames on GHC 7.2. The issue is apparently that as of GHC 7.2, getDirectoryContents now tries to decode filenames in the current locale, rather than converting a stream of bytes into characters: http://bugs.darcs.net/issue2095 I found an old thread on the subject: http://www.haskell.org/pipermail/haskell-cafe/2009-June/062795.html and some GHC tickets (e.g. http://hackage.haskell.org/trac/ghc/ticket/3300) Can anyone point me at the rationale and details of the change and/or suggest workarounds? Cheers, Ganesh

Hi Ganesh,
On 1 November 2011 07:16, Ganesh Sittampalam
Can anyone point me at the rationale and details of the change and/or suggest workarounds?
This is my implementation of Python's PEP 383 [1] for Haskell. IMHO this behaviour is much closer to what users expect.For example, getDirectoryContents "." >>= print shows Unicode filenames properly. As a result of this change we were able to close quite a few outstanding GHC bugs. PEP-383 behaviour always does the right thing on setups with a consistent text encoding for filenames, command line arguments and the like (Windows, or *nix where the system locale is e.g. UTF-8 and all filenames are encoded in that locale). However, there are legitimate use cases where the program has more information about how something is encoded than just the system locale, and in those cases you should *encode* the String from getDirectoryContents using GHC.IO.Encoding.fileSystemEncoding and then *decode* it with your preferred TextEncoding. In your case I think you want GHC.IO.Encoding.latin1. You can use a helper function like this to make this easier: reencode :: TextEncoding -> TextEncoding -> String -> String reencode from_enc to_enc from = unsafeLocalState $ GHC.Foreign.withCStringLen from_enc (GHC.Foreign.peekCStringLen to_enc) Hope that helps, Max [1] http://www.python.org/dev/peps/pep-0383/

Hi Max, On 01/11/2011 10:23, Max Bolingbroke wrote:
This is my implementation of Python's PEP 383 [1] for Haskell.
IMHO this behaviour is much closer to what users expect.For example, getDirectoryContents "." >>= print shows Unicode filenames properly. As a result of this change we were able to close quite a few outstanding GHC bugs.
Many thanks for your reply and all the subsequent followups and bugfixing. The workaround you propose seems a little complex and it might be a bit problematic that 100% roundtripping can't be guaranteed even once your fix is applied. Do you think it would be reasonable/feasible for darcs to have its own version of getDirectoryContents that doesn't try to do any translation in the first place? It might make sense to make a separate package that others could use to. BTW I was trying to find the patch where this changed but couldn't - was it a consequence of https://github.com/ghc/packages-base/commit/509f28cc93b980d30aca37008cbe66c6... ? Cheers, Ganesh

On 2 November 2011 21:46, Ganesh Sittampalam
The workaround you propose seems a little complex and it might be a bit problematic that 100% roundtripping can't be guaranteed even once your fix is applied.
I can understand this perspective, although the roundtripping as implemented will only fail in certain very obscure cases.
Do you think it would be reasonable/feasible for darcs to have its own version of getDirectoryContents that doesn't try to do any translation in the first place? It might make sense to make a separate package that others could use to.
Yes, absolutely! I think a very valuable contribution would be a package providing filesystem functions (with an abstract FilePath type) that is portable across Windows, OS X and *nix-like OSes. This would be a useful package for anyone who wants to avoid the performance (and very rare correctness) problems associated with roundtripping.
BTW I was trying to find the patch where this changed but couldn't - was it a consequence of https://github.com/ghc/packages-base/commit/509f28cc93b980d30aca37008cbe66c6... ?
That is the main patch. I had to patch the libraries as well to make use of the changed encodings. See (for example) https://github.com/ghc/packages-unix/commit/bb8a27d14a63fcd126a924d32c69b769... Max

On Thu, Nov 03, 2011 at 09:41:32AM +0000, Max Bolingbroke wrote:
On 2 November 2011 21:46, Ganesh Sittampalam
wrote: The workaround you propose seems a little complex and it might be a bit problematic that 100% roundtripping can't be guaranteed even once your fix is applied.
I can understand this perspective, although the roundtripping as implemented will only fail in certain very obscure cases.
Depending on the software one is writing, any failure, no matter how obscure, would not be acceptable. Think of a file browser, or backup software. So, yes, a non-destructive way of reading directories is important. At least one Linux distribution (I think Gentoo) actually has invalid pathnames in the filesystem in order to make sure that software that is part of the system will be able to handle them. For 'harchive' (which I am still gradually working on), I had to write my own version of readDirStream out of Posix that returns both the path and the inode number (FileID). Most filesystems on Linux are vastly faster if you 'stat' the entires of a directory in inode order rather than the order they were returned by readdir. In this sense, I'm not all that concerned if the regular getDirectoryContents isn't round trippable. David

On Tue, Nov 1, 2011 at 5:16 AM, Ganesh Sittampalam
I'm just investigating what we can do about a problem with darcs' handling of non-ASCII filenames on GHC 7.2.
The issue is apparently that as of GHC 7.2, getDirectoryContents now tries to decode filenames in the current locale, rather than converting a stream of bytes into characters: http://bugs.darcs.net/issue2095
I found an old thread on the subject: http://www.haskell.org/pipermail/haskell-cafe/2009-June/062795.html and some GHC tickets (e.g. http://hackage.haskell.org/trac/ghc/ticket/3300)
Can anyone point me at the rationale and details of the change and/or suggest workarounds?
You could try using system-fileio [1], but by reading its source code I guess that it may have the same bug (since it tries to decode what the directory package gives). I'm CCing John Millikin, its maintainer. Cheers, [1] http://hackage.haskell.org/packages/archive/system-fileio/0.3.2.1/doc/html/F... -- Felipe.

You're right -- many parts of system-fileio (the parts based on
"directory") are broken due to this. I'll need to update it to call
the posix/win32 functions directly.
IMO, the GHC behavior in <=7.0 is ugly, but the behavior in 7.2 is
fundamentally wrong.
Different OSes have different definitions of a "file path". A Windows
path is a sequence of Unicode characters. A Linux/BSD path is a
sequence of bytes. I'm not certain what OSX does, but I believe it
uses bytes also.
In GHC <= 7.0, the String type was used for both sorts of paths, with
interpretation of the contents being OS-dependent. This sort of works,
because it's possible to represent both byte- and text-based paths in
String.
GHC 7.2 assumes Linux/BSD paths are text, which 1) silently breaks all
existing code and 2) makes it impossible to fix within the given API.
On Tue, Nov 1, 2011 at 08:48, Felipe Almeida Lessa
On Tue, Nov 1, 2011 at 5:16 AM, Ganesh Sittampalam
wrote: I'm just investigating what we can do about a problem with darcs' handling of non-ASCII filenames on GHC 7.2.
The issue is apparently that as of GHC 7.2, getDirectoryContents now tries to decode filenames in the current locale, rather than converting a stream of bytes into characters: http://bugs.darcs.net/issue2095
I found an old thread on the subject: http://www.haskell.org/pipermail/haskell-cafe/2009-June/062795.html and some GHC tickets (e.g. http://hackage.haskell.org/trac/ghc/ticket/3300)
Can anyone point me at the rationale and details of the change and/or suggest workarounds?
You could try using system-fileio [1], but by reading its source code I guess that it may have the same bug (since it tries to decode what the directory package gives). I'm CCing John Millikin, its maintainer.
Cheers,
[1] http://hackage.haskell.org/packages/archive/system-fileio/0.3.2.1/doc/html/F...
-- Felipe.

Hi John,
On 1 November 2011 17:14, John Millikin
GHC 7.2 assumes Linux/BSD paths are text, which 1) silently breaks all existing code and 2) makes it impossible to fix within the given API.
Please can you give an example of code that is broken with the new behaviour? The PEP 383 mechanism will unavoidably break *some* code but I don't expect there to be much of it. One thing that most likely *will* be broken is code that attempts to reinterpret a String as a "byte string" - i.e. assuming that it was decoded using latin1, but I expect that such code can just be deleted when you upgrade to 7.2. As I pointed out earlier in the thread you can recover the old behaviour if you really want it by manually reencoding the strings, so I would dispute the claim that it is "impossible to fix within the given API". Cheers, Max

On Tue, Nov 1, 2011 at 11:43, Max Bolingbroke
Hi John,
On 1 November 2011 17:14, John Millikin
wrote: GHC 7.2 assumes Linux/BSD paths are text, which 1) silently breaks all existing code and 2) makes it impossible to fix within the given API.
Please can you give an example of code that is broken with the new behaviour? The PEP 383 mechanism will unavoidably break *some* code but I don't expect there to be much of it. One thing that most likely *will* be broken is code that attempts to reinterpret a String as a "byte string" - i.e. assuming that it was decoded using latin1, but I expect that such code can just be deleted when you upgrade to 7.2.
Examples of broken code are Darcs, my system-fileio, and likely anything else which needs to open Unicode-named files in both 7.0 and 7.2. As a quick example, consider the case of files with encodings different from the user's locale. This is *very* common, especially when interoperating with foreign Windows systems. $ ghci-7.0.4 GHC> import System.Directory GHC> createDirectory "path-test" GHC> writeFile "path-test/\xA1\xA5" "hello\n" GHC> writeFile "path-test/\xC2\xA1\xC2\xA5" "world\n" GHC> ^D $ ghci-7.2.1 GHC> import System.Directory GHC> getDirectoryContents "path-test" ["\161\165","\61345\61349","..","."] GHC> readFile "path-test/\161\165" "world\n" GHC> readFile "path-test/\61345\61349" *** Exception: path-test/: openFile: does not exist (No such file or directory)
As I pointed out earlier in the thread you can recover the old behaviour if you really want it by manually reencoding the strings, so I would dispute the claim that it is "impossible to fix within the given API".
Please describe how I can, in GHC 7.2, read the contents of the file "path-test/\xA1\xA5" without changing my locale. As far as I can tell, there is no way to do this using the standard libraries. I would have to fall back to the "unix" package, or even FFI imports, to open that file.

On 1 November 2011 20:13, John Millikin
$ ghci-7.2.1 GHC> import System.Directory GHC> getDirectoryContents "path-test" ["\161\165","\61345\61349","..","."] GHC> readFile "path-test/\161\165" "world\n" GHC> readFile "path-test/\61345\61349" *** Exception: path-test/: openFile: does not exist (No such file or directory)
Thanks for the example! I can reproduce this on Linux (haven't tried OS X or Windows) and AFAICT this behaviour is just a straight-up bug and is *not* intended behaviour. I'm not sure why the tests aren't catching it. I'm looking into it now. Max

On 2 November 2011 09:37, Max Bolingbroke
On 1 November 2011 20:13, John Millikin
wrote: $ ghci-7.2.1 GHC> import System.Directory GHC> getDirectoryContents "path-test" ["\161\165","\61345\61349","..","."] GHC> readFile "path-test/\161\165" "world\n" GHC> readFile "path-test/\61345\61349" *** Exception: path-test/: openFile: does not exist (No such file or directory)
Thanks for the example! I can reproduce this on Linux (haven't tried OS X or Windows) and AFAICT this behaviour is just a straight-up bug and is *not* intended behaviour. I'm not sure why the tests aren't catching it.
I've tracked it down and this bug arises in the following situation: 1. You are not running on Windows 2. You are attempting to encode a string containing the private-use escape codepoints 3. You are using an iconv (such as the one in GNU libc) that, in contravention of the Unicode standard, does not signal EILSEQ if surrogate codepoints are encountered in a non-UTF16 input I've got a patch that will work around the issue in most situations by avoiding the iconv code path. With the patch everything will work OK as long as the system locale is one that we have a native-Haskell decoder for (i.e. basically UTF-8). So you will still be able to get the broken behaviour if the above 3 conditions are met AND your system locale is not UTF-8. I think the only way to fix this last case in general is to fix iconv itself, so I'm going to see if I can get a patch upstream. Fixing it for people with UTF-8 locales should be enough for 99% of users, though. Max

On 2 November 2011 13:53, Max Bolingbroke
I think the only way to fix this last case in general is to fix iconv itself, so I'm going to see if I can get a patch upstream. Fixing it for people with UTF-8 locales should be enough for 99% of users, though.
One last update on this: I've found the cause of the problem in the GNU iconv source code and submitted a bug report. I've also found out that with my patch the problem should be fixed in almost every cases (not just 99%!) because GNU iconv will correctly reject surrogates in the UTF32le<->locale encoding conversion process with EILSEQ for every non-UTF8 locale encoding that I looked at - even UTF16 and UTF32! So in conclusion I think this issue is totally fixed. Please let me know if you encounter any other problems. Max

On Wed, Nov 2, 2011 at 06:53, Max Bolingbroke
I've got a patch that will work around the issue in most situations by avoiding the iconv code path. With the patch everything will work OK as long as the system locale is one that we have a native-Haskell decoder for (i.e. basically UTF-8). So you will still be able to get the broken behaviour if the above 3 conditions are met AND your system locale is not UTF-8.
What package does this patch -- "unix", "directory", something else?
I think the only way to fix this last case in general is to fix iconv itself, so I'm going to see if I can get a patch upstream. Fixing it for people with UTF-8 locales should be enough for 99% of users, though.
Maybe I'm misunderstanding, but it sounds like you're still trying to treat posix file paths as text. There should not be any iconv or locales or anything involved in looking up a posix file path.

On 2 November 2011 17:15, John Millikin
What package does this patch -- "unix", "directory", something else?
The "base" package. The problem lay in the implementation of GHC.IO.Encoding.fileSystemEncoding on non-Windows OSes.
Maybe I'm misunderstanding, but it sounds like you're still trying to treat posix file paths as text. There should not be any iconv or locales or anything involved in looking up a posix file path.
The thing is that on every non-Unix OS paths *can* be interpreted as text, and people expect them to be. In fact, even on Unix most programs/frameworks interpret them as text - e.g. IIRC QT's QString class is used for filenames in that framework, and if you view filenames in an end-user app like Nautilus it obviously decodes them in the current locale for presentation. Paths as text is just what people expect, and is grandfathered into the Haskell libraries itself as "type FilePath = String". PEP-383 behaviour is (I think) a good way to satisfy this expectation while still not sacrificing the ability to deal with files that have names encoded in some way other than the locale encoding. (Perhaps if Haskell had an abstract FilePath data type rather than FilePath = String we could do something different. But it's not clear if we could, without also having ugliness like getArgs :: IO [Byte]) Cheers, Max

FYI: I just released new versions of system-filepath and
system-fileio, which attempt to work around the changes in GHC 7.2.
On Wed, Nov 2, 2011 at 11:55, Max Bolingbroke
Maybe I'm misunderstanding, but it sounds like you're still trying to treat posix file paths as text. There should not be any iconv or locales or anything involved in looking up a posix file path.
The thing is that on every non-Unix OS paths *can* be interpreted as text, and people expect them to be. In fact, even on Unix most programs/frameworks interpret them as text - e.g. IIRC QT's QString class is used for filenames in that framework, and if you view filenames in an end-user app like Nautilus it obviously decodes them in the current locale for presentation.
There is a difference between how paths are rendered to users, and how they are handled by applications. Applications *must* use whatever the operating system says a path is. If a path is bytes, they must use bytes. If a path is text, they must use text. How they present paths to the user is a matter of user interface design. For what it's worth, on my Ubuntu system, Nautilus ignores the locale and just treats all paths as either UTF8 or invalid. To me, this seems like the most reasonable option; the concept of "locale encoding" is entirely vestigal, and should only be used in certain specialized cases.
Paths as text is just what people expect, and is grandfathered into the Haskell libraries itself as "type FilePath = String". PEP-383 behaviour is (I think) a good way to satisfy this expectation while still not sacrificing the ability to deal with files that have names encoded in some way other than the locale encoding.
Paths as text is what *Windows* programmers expect. Paths as bytes is what's expected by programmers on non-Windows OSes, including Linux and OS X. I'm not saying one is inherently better than the other, but considering that various UNIX and UNIX-like operating systems have been using byte-based paths for near on forty years now, trying to abolish them by redefining the type is not a useful action.
(Perhaps if Haskell had an abstract FilePath data type rather than FilePath = String we could do something different.
This is the general purpose of my system-filepath package, which provides a set of generic modifications, applicable to paths from various OS families.
But it's not clear if we could, without also having ugliness like getArgs :: IO [Byte])
We *ought* to have getArgs :: IO [ByteString], at least on POSIX systems. It's totally OK if high-level packages like "directory" want to hide details behind some nice abstractions. But the low-level libraries, like "base" and "unix" and "Win32", must must must provide direct low-level access to the operating system's APIs. The only other option is to re-implement half of the standard library using FFI bindings, which is ugly (for file/directory manipulation) or nearly impossible (for opening handles). If you're going to make all the System.IO stuff use text, at least give us an escape hatch. The "unix" package is ideally suited, as it's already inherently OS-specific. Something like this would be perfect: ------------------ System.Posix.File.openHandle :: CString -> IOMode -> IO Handle System.Posix.File.rename :: CString -> CString -> IO () ------------------

On 6 November 2011 04:14, John Millikin
For what it's worth, on my Ubuntu system, Nautilus ignores the locale and just treats all paths as either UTF8 or invalid. To me, this seems like the most reasonable option; the concept of "locale encoding" is entirely vestigal, and should only be used in certain specialized cases.
Unfortunately non-UTF8 locale encodings are seen in practice quite often. I'm not sure about Linux, but certainly lots of Windows systems are configured with a locale encoding like GBK or Big5.
Paths as text is what *Windows* programmers expect. Paths as bytes is what's expected by programmers on non-Windows OSes, including Linux and OS X.
IIRC paths on OS X are guaranteed to be valid UTF-8. The only platform that uses bytes for paths (that we care about) is Linux.
I'm not saying one is inherently better than the other, but considering that various UNIX and UNIX-like operating systems have been using byte-based paths for near on forty years now, trying to abolish them by redefining the type is not a useful action.
We have to: 1. Provide an API that makes sense on all our supported OSes 2. Have getArgs :: IO [String] 3. Have it such that if you go to your console and write (./MyHaskellProgram 你好) then getArgs tells you ["你好"] Given these constraints I don't see any alternative to PEP-383 behaviour.
If you're going to make all the System.IO stuff use text, at least give us an escape hatch. The "unix" package is ideally suited, as it's already inherently OS-specific. Something like this would be perfect:
You can already do this with the implemented design. We have: openFile :: FilePath -> IO Handle The FilePath will be encoded in the fileSystemEncoding. On Unix this will have PEP383 roundtripping behaviour. So if you want openFile' :: [Byte] -> IO Handle you can write something like this: escape = map (\b -> if b < 128 then chr b else chr (0xEF00 + b)) openFile = openFile' . escape The bytes that reach the API call will be exactly the ones you supply. (You can also implement "escape" by just encoding the [Byte] with the fileSystemEncoding). Likewise, if you have a String and want to get the [Byte] we decoded it from, you just need to encode the String again with the fileSystemEncoding. If this is not enough for you please let me know, but it seems to me that it covers all your use cases, without any need to reimplement the FFI bindings. Max

2011/11/6 Max Bolingbroke
On 6 November 2011 04:14, John Millikin
wrote: For what it's worth, on my Ubuntu system, Nautilus ignores the locale and just treats all paths as either UTF8 or invalid. To me, this seems like the most reasonable option; the concept of "locale encoding" is entirely vestigal, and should only be used in certain specialized cases.
Unfortunately non-UTF8 locale encodings are seen in practice quite often. I'm not sure about Linux, but certainly lots of Windows systems are configured with a locale encoding like GBK or Big5.
This doesn't really matter for file paths, though. The Win32 file API uses wide-character functions, which ought to work with Unicode text regardless of what the user set their locale to.
Paths as text is what *Windows* programmers expect. Paths as bytes is what's expected by programmers on non-Windows OSes, including Linux and OS X.
IIRC paths on OS X are guaranteed to be valid UTF-8. The only platform that uses bytes for paths (that we care about) is Linux.
UTF-8 is bytes. It can be treated as text in some cases, but it's better to think about it as bytes.
I'm not saying one is inherently better than the other, but considering that various UNIX and UNIX-like operating systems have been using byte-based paths for near on forty years now, trying to abolish them by redefining the type is not a useful action.
We have to: 1. Provide an API that makes sense on all our supported OSes 2. Have getArgs :: IO [String] 3. Have it such that if you go to your console and write (./MyHaskellProgram 你好) then getArgs tells you ["你好"]
Given these constraints I don't see any alternative to PEP-383 behaviour.
Requirement #1 directly contradicts #2 and #3.
If you're going to make all the System.IO stuff use text, at least give us an escape hatch. The "unix" package is ideally suited, as it's already inherently OS-specific. Something like this would be perfect:
You can already do this with the implemented design. We have:
openFile :: FilePath -> IO Handle
The FilePath will be encoded in the fileSystemEncoding. On Unix this will have PEP383 roundtripping behaviour. So if you want openFile' :: [Byte] -> IO Handle you can write something like this:
escape = map (\b -> if b < 128 then chr b else chr (0xEF00 + b)) openFile = openFile' . escape
The bytes that reach the API call will be exactly the ones you supply. (You can also implement "escape" by just encoding the [Byte] with the fileSystemEncoding).
Likewise, if you have a String and want to get the [Byte] we decoded it from, you just need to encode the String again with the fileSystemEncoding.
If this is not enough for you please let me know, but it seems to me that it covers all your use cases, without any need to reimplement the FFI bindings.
This is not enough, since these strings are still being passed through the potentially (and in 7.2.1, actually) broken path encoder. If the "unix" package had defined functions which operate on the correct type (CString / ByteString), then it would not be necessary to patch "base". I could just call the POSIX functions from system-fileio and be done with it. And this solution still assumes that there is such a thing as a filesystem encoding in POSIX. There isn't. A file path is an arbitrary sequence of bytes, with no significance except what the application user interface decides. It seems to me that there's two ways to provide bindings to operating system functionality. One is to give low-level access, using abstractions as close to the real API as possible. In this model, "unix" would provide functions like [[ rename :: ByteString -> ByteString -> IO () ]], and I would know that it's not going to do anything weird to the parameters. Another is to pretend that operating systems are all the same, and can have their APIs abstracted away to some hypothetical virtual system. This model just makes it more difficult for programmers to access the OS, as they have to learn both the standard API, *and* whatever weird thing has been layered on top of it.

Quoth John Millikin
One is to give low-level access, using abstractions as close to the real API as possible. In this model, "unix" would provide functions like [[ rename :: ByteString -> ByteString -> IO () ]], and I would know that it's not going to do anything weird to the parameters.
I like that a lot. In the "PEP" I see the phrase "in the same way that the C interfaces can ignore the encoding" - and the above low level access seems to belong to that same non-problematic category. Donn

On 06/11/2011 16:56, John Millikin wrote:
2011/11/6 Max Bolingbroke
: On 6 November 2011 04:14, John Millikin
wrote: For what it's worth, on my Ubuntu system, Nautilus ignores the locale and just treats all paths as either UTF8 or invalid. To me, this seems like the most reasonable option; the concept of "locale encoding" is entirely vestigal, and should only be used in certain specialized cases.
Unfortunately non-UTF8 locale encodings are seen in practice quite often. I'm not sure about Linux, but certainly lots of Windows systems are configured with a locale encoding like GBK or Big5.
This doesn't really matter for file paths, though. The Win32 file API uses wide-character functions, which ought to work with Unicode text regardless of what the user set their locale to.
Paths as text is what *Windows* programmers expect. Paths as bytes is what's expected by programmers on non-Windows OSes, including Linux and OS X.
IIRC paths on OS X are guaranteed to be valid UTF-8. The only platform that uses bytes for paths (that we care about) is Linux.
UTF-8 is bytes. It can be treated as text in some cases, but it's better to think about it as bytes.
I'm not saying one is inherently better than the other, but considering that various UNIX and UNIX-like operating systems have been using byte-based paths for near on forty years now, trying to abolish them by redefining the type is not a useful action.
We have to: 1. Provide an API that makes sense on all our supported OSes 2. Have getArgs :: IO [String] 3. Have it such that if you go to your console and write (./MyHaskellProgram 你好) then getArgs tells you ["你好"]
Given these constraints I don't see any alternative to PEP-383 behaviour.
Requirement #1 directly contradicts #2 and #3.
If you're going to make all the System.IO stuff use text, at least give us an escape hatch. The "unix" package is ideally suited, as it's already inherently OS-specific. Something like this would be perfect:
You can already do this with the implemented design. We have:
openFile :: FilePath -> IO Handle
The FilePath will be encoded in the fileSystemEncoding. On Unix this will have PEP383 roundtripping behaviour. So if you want openFile' :: [Byte] -> IO Handle you can write something like this:
escape = map (\b -> if b< 128 then chr b else chr (0xEF00 + b)) openFile = openFile' . escape
The bytes that reach the API call will be exactly the ones you supply. (You can also implement "escape" by just encoding the [Byte] with the fileSystemEncoding).
Likewise, if you have a String and want to get the [Byte] we decoded it from, you just need to encode the String again with the fileSystemEncoding.
If this is not enough for you please let me know, but it seems to me that it covers all your use cases, without any need to reimplement the FFI bindings.
This is not enough, since these strings are still being passed through the potentially (and in 7.2.1, actually) broken path encoder.
I think you might be misunderstanding how the new API works. Basically, imagine a reversible transformation: encode :: String -> [Word8] decode :: [Word8] -> String this transformation is applied in the appropriate direction by the IO library to translate filesystem paths into FilePath and vice versa. No information is lost; furthermore you can apply the transformation yourself in order to recover the original [Word8] from a String, or to inject your own [Word8] file path. Ok? All this does is mean that the common case where you want to interpret file system paths as text works with no fuss, without breaking anything in the case when the file system paths are not actually text. It would probably be better to have an abstract FilePath type and to keep the original bytes, decoding on demand. But that is a big change to the API and would break much more code. One day we'll do this properly; for now we have this, which I think is a pretty reasonble compromise. Cheers, Simon

On Mon, Nov 7, 2011 at 09:02, Simon Marlow
I think you might be misunderstanding how the new API works. Basically, imagine a reversible transformation:
encode :: String -> [Word8] decode :: [Word8] -> String
this transformation is applied in the appropriate direction by the IO library to translate filesystem paths into FilePath and vice versa. No information is lost; furthermore you can apply the transformation yourself in order to recover the original [Word8] from a String, or to inject your own [Word8] file path.
Ok?
I understand how the API is intended / designed to work; however, the implementation does not actually do this. My argument is that this transformation should be in a high-level library like "directory", and the low-level libraries like "base" or "unix" ought to provide functions which do not transform their inputs. That way, when an error is found in the encoding logic, it can be fixed by just pushing a new version of the affected library to Hackage, instead of requiring a new version of the compiler. I am also not convinced that it is possible to correctly implement either of these functions if their behavior is dependent on the user's locale.
All this does is mean that the common case where you want to interpret file system paths as text works with no fuss, without breaking anything in the case when the file system paths are not actually text.
As mentioned earlier in the thread, this behavior is breaking things. Due to an implementation error, programs compiled with GHC 7.2 on POSIX systems cannot open files unless their paths also happen to be valid text according to their locale. It is very difficult to work around this error, because the paths-are-text logic was placed at a very low level in the library stack.
It would probably be better to have an abstract FilePath type and to keep the original bytes, decoding on demand. But that is a big change to the API and would break much more code. One day we'll do this properly; for now we have this, which I think is a pretty reasonble compromise.
Please understand, I am not arguing against the existence of this encoding layer in general. It's a fine idea for a simplistic high-level filesystem interaction library. But it should be *optional*, not part of the compiler or "base. As implemented in GHC 7.2, this encoding is a complex and untested behavior with no escape hatch.

Simon Marlow wrote:
It would probably be better to have an abstract FilePath type and to keep the original bytes, decoding on demand. But that is a big change to the API and would break much more code. One day we'll do this properly; for now we have this, which I think is a pretty reasonble compromise.
John Millikin wrote:
Please understand, I am not arguing against the existence of this encoding layer in general. It's a fine idea for a simplistic high-level filesystem interaction library. But it should be *optional*, not part of the compiler or "base.
The problem is that Haskell 98 specifies type FilePath = String. In retrospect, we now know that this is too simplistic. But that's what we have right now.
As implemented in GHC 7.2, this encoding is a complex and untested behavior with no escape hatch.
Isn't System.Posix.IO the escape hatch? Even though FilePath is still used there instead of ByteString as it should be, this is the low-level POSIX-specific library. So the old hack of interpreting the lowest 8 bits as bytes makes a lot more sense there. Thanks, Yitz

On Mon, Nov 7, 2011 at 15:39, Yitzchak Gale
The problem is that Haskell 98 specifies type FilePath = String. In retrospect, we now know that this is too simplistic. But that's what we have right now.
This is *a* problem, but not a particularly major one; the definition of paths in GHC 7.0 (text on some systems, bytes on others) is inelegant but workable. The main problem, IMO, is that the semantics of openFile et al changed in a way that is impossible to check for statically, and there was no mention of this in the documentation. It's one thing to make a change which will cause new compilation failures. It's quite another to introduce an undocumented change in important semantics.
As implemented in GHC 7.2, this encoding is a complex and untested behavior with no escape hatch.
Isn't System.Posix.IO the escape hatch?
Even though FilePath is still used there instead of ByteString as it should be, this is the low-level POSIX-specific library. So the old hack of interpreting the lowest 8 bits as bytes makes a lot more sense there.
System.Posix.IO, and the "unix" package in general, also perform the new path encoding/decoding.

On 07/11/2011 17:32, John Millikin wrote:
On Mon, Nov 7, 2011 at 09:02, Simon Marlow
wrote: I think you might be misunderstanding how the new API works. Basically, imagine a reversible transformation:
encode :: String -> [Word8] decode :: [Word8] -> String
this transformation is applied in the appropriate direction by the IO library to translate filesystem paths into FilePath and vice versa. No information is lost; furthermore you can apply the transformation yourself in order to recover the original [Word8] from a String, or to inject your own [Word8] file path.
Ok?
I understand how the API is intended / designed to work; however, the implementation does not actually do this. My argument is that this transformation should be in a high-level library like "directory", and the low-level libraries like "base" or "unix" ought to provide functions which do not transform their inputs. That way, when an error is found in the encoding logic, it can be fixed by just pushing a new version of the affected library to Hackage, instead of requiring a new version of the compiler.
I am also not convinced that it is possible to correctly implement either of these functions if their behavior is dependent on the user's locale.
All this does is mean that the common case where you want to interpret file system paths as text works with no fuss, without breaking anything in the case when the file system paths are not actually text.
As mentioned earlier in the thread, this behavior is breaking things. Due to an implementation error, programs compiled with GHC 7.2 on POSIX systems cannot open files unless their paths also happen to be valid text according to their locale. It is very difficult to work around this error, because the paths-are-text logic was placed at a very low level in the library stack.
So your objection is that there is a bug? What if we fixed the bug?
It would probably be better to have an abstract FilePath type and to keep the original bytes, decoding on demand. But that is a big change to the API and would break much more code. One day we'll do this properly; for now we have this, which I think is a pretty reasonble compromise.
Please understand, I am not arguing against the existence of this encoding layer in general. It's a fine idea for a simplistic high-level filesystem interaction library. But it should be *optional*, not part of the compiler or "base.
Ok, so I was about to reply and say that the low-level API is available via the unix and Win32 packages, and then I thought I should check first, and I discovered that even using System.Posix you get the magic encoding behaviour. I really think we should provide the native APIs. The problem is that the System.Posix.Directory API is all in terms of FilePath (=String), and if we gave that a different meaning from the System.Directory FilePaths then confusion would ensue. So perhaps we need to add another API to System.Posix with filesystem operations in terms of ByteString, and similarly for Win32. Cheers, Simon

On Tue, Nov 8, 2011 at 03:04, Simon Marlow
As mentioned earlier in the thread, this behavior is breaking things. Due to an implementation error, programs compiled with GHC 7.2 on POSIX systems cannot open files unless their paths also happen to be valid text according to their locale. It is very difficult to work around this error, because the paths-are-text logic was placed at a very low level in the library stack.
So your objection is that there is a bug? What if we fixed the bug?
My objection is that the current implementation provides no way to work around potential bugs. GHC is software. Like all software, it contains errors, and new features are likely to contain more errors. When adding behavior like automatic path encoding, there should always be a way to avoid or work around it, in case a severe bug is discovered.
It would probably be better to have an abstract FilePath type and to keep the original bytes, decoding on demand. But that is a big change to the API and would break much more code. One day we'll do this properly; for now we have this, which I think is a pretty reasonble compromise.
Please understand, I am not arguing against the existence of this encoding layer in general. It's a fine idea for a simplistic high-level filesystem interaction library. But it should be *optional*, not part of the compiler or "base.
Ok, so I was about to reply and say that the low-level API is available via the unix and Win32 packages, and then I thought I should check first, and I discovered that even using System.Posix you get the magic encoding behaviour.
I really think we should provide the native APIs. The problem is that the System.Posix.Directory API is all in terms of FilePath (=String), and if we gave that a different meaning from the System.Directory FilePaths then confusion would ensue. So perhaps we need to add another API to System.Posix with filesystem operations in terms of ByteString, and similarly for Win32.
+1 I think most users would be OK with having System.Posix treat FilePath differently, as long as this is clearly documented, but if you feel a separate API is better then I have no objection. As long as there's some way to say "I know what I'm doing, here's the bytes" to the library. The Win32 package uses wide-character functions, so I'm not sure whether bytes would be appropriate there. My instinct says to stick with chars, via withCWString or equivalent. The package maintainer will have a better idea of what fits with the OS's idioms.

On 08/11/2011 15:42, John Millikin wrote:
On Tue, Nov 8, 2011 at 03:04, Simon Marlow
wrote: I really think we should provide the native APIs. The problem is that the System.Posix.Directory API is all in terms of FilePath (=String), and if we gave that a different meaning from the System.Directory FilePaths then confusion would ensue. So perhaps we need to add another API to System.Posix with filesystem operations in terms of ByteString, and similarly for Win32.
+1
I think most users would be OK with having System.Posix treat FilePath differently, as long as this is clearly documented, but if you feel a separate API is better then I have no objection. As long as there's some way to say "I know what I'm doing, here's the bytes" to the library.
The Win32 package uses wide-character functions, so I'm not sure whether bytes would be appropriate there. My instinct says to stick with chars, via withCWString or equivalent. The package maintainer will have a better idea of what fits with the OS's idioms.
Ok, I spent most of today adding ByteString alternatives for all of the functions in System.Posix that use FilePath or environment strings. The Haddocks for my augmented unix package are here: http://community.haskell.org/~simonmar/unix-with-bytestring-extras/index.htm... In particular, the module System.Posix.ByteString is the whole System.Posix API but with ByteString FilePaths and environment strings: http://community.haskell.org/~simonmar/unix-with-bytestring-extras/System-Po... It has one addition relative to System.Posix: getArgs :: IO [ByteString] Let me know what you think. I suspect the main controversial aspect is that I included type FilePath = ByteString which is a bit cute but might be confusing. Cheers, Simon

On Wed, Nov 9, 2011 at 08:04, Simon Marlow
Ok, I spent most of today adding ByteString alternatives for all of the functions in System.Posix that use FilePath or environment strings. The Haddocks for my augmented unix package are here:
http://community.haskell.org/~simonmar/unix-with-bytestring-extras/index.htm...
In particular, the module System.Posix.ByteString is the whole System.Posix API but with ByteString FilePaths and environment strings:
http://community.haskell.org/~simonmar/unix-with-bytestring-extras/System-Po...
This looks lovely -- thank you. Once it's released, I'll port all my libraries over to using it.
It has one addition relative to System.Posix:
getArgs :: IO [ByteString]
Thank you very much! Several tools I use daily accept binary data as command-line options, and this will make it much easier to port them to Haskell in the future.
Let me know what you think. I suspect the main controversial aspect is that I included
type FilePath = ByteString
which is a bit cute but might be confusing.
Indeed, I was very confused when I saw that in the docs. If it's not too much trouble, could those functions accept/return ByteString directly?

On 09/11/2011 16:42, John Millikin wrote:
On Wed, Nov 9, 2011 at 08:04, Simon Marlow
wrote: Ok, I spent most of today adding ByteString alternatives for all of the functions in System.Posix that use FilePath or environment strings. The Haddocks for my augmented unix package are here:
http://community.haskell.org/~simonmar/unix-with-bytestring-extras/index.htm...
In particular, the module System.Posix.ByteString is the whole System.Posix API but with ByteString FilePaths and environment strings:
http://community.haskell.org/~simonmar/unix-with-bytestring-extras/System-Po...
This looks lovely -- thank you.
Once it's released, I'll port all my libraries over to using it.
It has one addition relative to System.Posix:
getArgs :: IO [ByteString]
Thank you very much! Several tools I use daily accept binary data as command-line options, and this will make it much easier to port them to Haskell in the future.
Let me know what you think. I suspect the main controversial aspect is that I included
type FilePath = ByteString
which is a bit cute but might be confusing.
Indeed, I was very confused when I saw that in the docs. If it's not too much trouble, could those functions accept/return ByteString directly?
I've done a search/replace and called it RawFilePath. Ok? Cheers, Simon

On 11/8/11 6:04 AM, Simon Marlow wrote:
I really think we should provide the native APIs. The problem is that the System.Posix.Directory API is all in terms of FilePath (=String), and if we gave that a different meaning from the System.Directory FilePaths then confusion would ensue. So perhaps we need to add another API to System.Posix with filesystem operations in terms of ByteString, and similarly for Win32.
+1. It'd be nice to have an abstract FilePath. But until that happens, it's important to distinguish the automagic type from the raw type. H98's FilePath=String vs ByteString seems a good way to do that. -- Live well, ~wren

On 7 November 2011 17:32, John Millikin
I am also not convinced that it is possible to correctly implement either of these functions if their behavior is dependent on the user's locale.
FWIW it's only dependent on the users locale because whether glibc iconv detects errors in the *from* sequence depends on what the *to* locale is. Clearly an invalid *from* sequence should be reported as invalid regardless of *to*. I know this isn't much comfort to you, though, since you do have to worry about broken behaviour in 7.2, and possible future breakage with changes in iconv. I understand your point that it would be better from a complexity point of view to just roundtrip the bytes as *bytes* without relying on all this escaping/unescaping code.
Please understand, I am not arguing against the existence of this encoding layer in general. It's a fine idea for a simplistic high-level filesystem interaction library. But it should be *optional*, not part of the compiler or "base.
The problem is that I *really really want* getArgs to decode the command line arguments. That's almost the whole point of this change, and it is what most users seem to expect. Given this constraint, the code has to be part of "base", and if getArgs has this behaviour then any file system function we ship that takes a FilePath (i.e. all the functions in base, directory, win32 and unix) must be prepared to handle these escape characters for consistency. I *would* be happy to expose an alternative file system API from the posix package that operates with ByteString paths. This package could provide a function :: FilePath -> ByteString that encodes the string with the fileSystemEncoding (removing escapes in the process) for interoperability with file names arriving via getArgs, and at that point the decision about whether to use the escaping/unescaping code would be (mostly) in the hands of the user. We could even have posix expose APIs to get command line arguments/environment variables as ByteStrings, and then you could avoid escape/unescape entirely. Which of these solutions (if any) would satisfy you? 1. The current situation, plus an alternative API exposed from "posix" along the lines described above 2. The current situation but with the escape/unescape modified so it allows true roundtripping (at the cost of weird "surrogate" Char values popping up now and again). If you have this you can reliably implement the alternative API on top of the String based one, assuming we got our escape/unescape code right I hope we can work together to find a solution here. Cheers, Max

On Mon, Nov 07, 2011 at 05:02:32PM +0000, Simon Marlow wrote:
Basically, imagine a reversible transformation:
encode :: String -> [Word8] decode :: [Word8] -> String
this transformation is applied in the appropriate direction by the IO library to translate filesystem paths into FilePath and vice versa. No information is lost
I think that would be great if it were true, but it isn't: $ touch `printf '\x80'` $ touch `printf '\xEE\xBE\x80'` $ ghc -e 'System.Directory.getDirectoryContents "." >>= print' ["\61312",".","\61312",".."] Both of those filenames get encoded as \61312 (U+EF80). Thanks Ian

On 07/11/2011 17:57, Ian Lynagh wrote:
On Mon, Nov 07, 2011 at 05:02:32PM +0000, Simon Marlow wrote:
Basically, imagine a reversible transformation:
encode :: String -> [Word8] decode :: [Word8] -> String
this transformation is applied in the appropriate direction by the IO library to translate filesystem paths into FilePath and vice versa. No information is lost
I think that would be great if it were true, but it isn't:
$ touch `printf '\x80'` $ touch `printf '\xEE\xBE\x80'` $ ghc -e 'System.Directory.getDirectoryContents ".">>= print' ["\61312",".","\61312",".."]
Both of those filenames get encoded as \61312 (U+EF80).
Ouch, I missed that. I was under the impression that we guaranteed roundtripping, but it seems not. Max - we need to fix this. Cheers, Simon

for what it is worth, I would like to see both System.IO and Directory export "internal functions" where the filepath is a Raw Byte representation. I have utilities that regularly scan 100,000 of files and hash the path the details of which are irrelevant to this discussion, the point being that the locale encoding/decoding is not relevant in this situation and adds unnecessary overhead that would affect the speed of the file-system scans. A denotation of a filepath as an uninterpreted sequence of bytes is the lowest common denominator for all systems that I know of and would be worthwhile to export from the system libraries upon which other abstractions can be built. I agree that for the general user the current behavior is sufficient, however exporting the raw interface would be beneficial for some users, for instance those that have responded to this thread. On 7/11/2011 2:42 AM, Max Bolingbroke wrote:
On 6 November 2011 04:14, John Millikin
wrote: For what it's worth, on my Ubuntu system, Nautilus ignores the locale and just treats all paths as either UTF8 or invalid. To me, this seems like the most reasonable option; the concept of "locale encoding" is entirely vestigal, and should only be used in certain specialized cases.
Unfortunately non-UTF8 locale encodings are seen in practice quite often. I'm not sure about Linux, but certainly lots of Windows systems are configured with a locale encoding like GBK or Big5.
Paths as text is what *Windows* programmers expect. Paths as bytes is what's expected by programmers on non-Windows OSes, including Linux and OS X.
IIRC paths on OS X are guaranteed to be valid UTF-8. The only platform that uses bytes for paths (that we care about) is Linux.
I'm not saying one is inherently better than the other, but considering that various UNIX and UNIX-like operating systems have been using byte-based paths for near on forty years now, trying to abolish them by redefining the type is not a useful action.
We have to: 1. Provide an API that makes sense on all our supported OSes 2. Have getArgs :: IO [String] 3. Have it such that if you go to your console and write (./MyHaskellProgram 你好) then getArgs tells you ["你好"]
Given these constraints I don't see any alternative to PEP-383 behaviour.
If you're going to make all the System.IO stuff use text, at least give us an escape hatch. The "unix" package is ideally suited, as it's already inherently OS-specific. Something like this would be perfect:
You can already do this with the implemented design. We have:
openFile :: FilePath -> IO Handle
The FilePath will be encoded in the fileSystemEncoding. On Unix this will have PEP383 roundtripping behaviour. So if you want openFile' :: [Byte] -> IO Handle you can write something like this:
escape = map (\b -> if b< 128 then chr b else chr (0xEF00 + b)) openFile = openFile' . escape
The bytes that reach the API call will be exactly the ones you supply. (You can also implement "escape" by just encoding the [Byte] with the fileSystemEncoding).
Likewise, if you have a String and want to get the [Byte] we decoded it from, you just need to encode the String again with the fileSystemEncoding.
If this is not enough for you please let me know, but it seems to me that it covers all your use cases, without any need to reimplement the FFI bindings.
Max
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Can't we just have the usual .Internal module convention, where people who
want internals can get at them if they need to, and most people get a
simpler interface? It's amazingly frustrating when you have a library that
does 99% of what you need it to do, except for one tiny internal detail
that the author didn't foresee anyone needing, so didn't export.
2011/11/6 John Lask
for what it is worth, I would like to see both System.IO and Directory export "internal functions" where the filepath is a Raw Byte representation.
I have utilities that regularly scan 100,000 of files and hash the path the details of which are irrelevant to this discussion, the point being that the locale encoding/decoding is not relevant in this situation and adds unnecessary overhead that would affect the speed of the file-system scans.
A denotation of a filepath as an uninterpreted sequence of bytes is the lowest common denominator for all systems that I know of and would be worthwhile to export from the system libraries upon which other abstractions can be built.
I agree that for the general user the current behavior is sufficient, however exporting the raw interface would be beneficial for some users, for instance those that have responded to this thread.
On 7/11/2011 2:42 AM, Max Bolingbroke wrote:
On 6 November 2011 04:14, John Millikin
wrote: For what it's worth, on my Ubuntu system, Nautilus ignores the locale and just treats all paths as either UTF8 or invalid. To me, this seems like the most reasonable option; the concept of "locale encoding" is entirely vestigal, and should only be used in certain specialized cases.
Unfortunately non-UTF8 locale encodings are seen in practice quite often. I'm not sure about Linux, but certainly lots of Windows systems are configured with a locale encoding like GBK or Big5.
Paths as text is what *Windows* programmers expect. Paths as bytes is what's expected by programmers on non-Windows OSes, including Linux and OS X.
IIRC paths on OS X are guaranteed to be valid UTF-8. The only platform that uses bytes for paths (that we care about) is Linux.
I'm not saying one is inherently better than the other, but considering that various UNIX and UNIX-like operating systems have been using byte-based paths for near on forty years now, trying to abolish them by redefining the type is not a useful action.
We have to: 1. Provide an API that makes sense on all our supported OSes 2. Have getArgs :: IO [String] 3. Have it such that if you go to your console and write (./MyHaskellProgram 你好) then getArgs tells you ["你好"]
Given these constraints I don't see any alternative to PEP-383 behaviour.
If you're going to make all the System.IO stuff use text, at least give us an escape hatch. The "unix" package is ideally suited, as it's already inherently OS-specific. Something like this would be perfect:
You can already do this with the implemented design. We have:
openFile :: FilePath -> IO Handle
The FilePath will be encoded in the fileSystemEncoding. On Unix this will have PEP383 roundtripping behaviour. So if you want openFile' :: [Byte] -> IO Handle you can write something like this:
escape = map (\b -> if b< 128 then chr b else chr (0xEF00 + b)) openFile = openFile' . escape
The bytes that reach the API call will be exactly the ones you supply. (You can also implement "escape" by just encoding the [Byte] with the fileSystemEncoding).
Likewise, if you have a String and want to get the [Byte] we decoded it from, you just need to encode the String again with the fileSystemEncoding.
If this is not enough for you please let me know, but it seems to me that it covers all your use cases, without any need to reimplement the FFI bindings.
Max
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Hi, On 01.11.2011, at 19:43, Max Bolingbroke wrote:
As I pointed out earlier in the thread you can recover the old behaviour if you really want it by manually reencoding the strings, so I would dispute the claim that it is "impossible to fix within the given API".
As far as I know, not all encodings are reversable. I.e. there are byte sequences which are invalid utf-8. Therefore, decoding and re-encoding might not return the exact same byte sequence. Cheers, Jean

On 2 November 2011 10:03, Jean-Marie Gaillourdet
As far as I know, not all encodings are reversable. I.e. there are byte sequences which are invalid utf-8. Therefore, decoding and re-encoding might not return the exact same byte sequence.
The PEP 383 mechanism explicitly recognises this fact and defines a reversible way of decoding bytes into strings. The new behaviour is guaranteed to be reversible except for certain private use codepoints (0xEF00 to 0xEFFF inclusive) which: 1. We do not expect to see in practice 2. Are unofficially standardised for use with this sort of "encoding hack" Max

On Wed, Nov 02, 2011 at 01:29:16PM +0000, Max Bolingbroke wrote:
On 2 November 2011 10:03, Jean-Marie Gaillourdet
wrote: As far as I know, not all encodings are reversable. I.e. there are byte sequences which are invalid utf-8. Therefore, decoding and re-encoding might not return the exact same byte sequence.
The PEP 383 mechanism explicitly recognises this fact and defines a reversible way of decoding bytes into strings. The new behaviour is guaranteed to be reversible except for certain private use codepoints (0xEF00 to 0xEFFF inclusive) which: 1. We do not expect to see in practice 2. Are unofficially standardised for use with this sort of "encoding hack"
I don't understand this. If I understand correctly, you use U+EF00-U+EFFF to encode the characters 0-255 when they are not a valid part of the UTF8 stream. So why not encode U+EF00 (which in UTF8 is 0xEE 0xBC 0x80) as U+EFEE U+EFBC U+EF80, and so on? Doesn't it then become completely reversible? Thanks Ian

On 2 November 2011 16:29, Ian Lynagh
If I understand correctly, you use U+EF00-U+EFFF to encode the characters 0-255 when they are not a valid part of the UTF8 stream.
Yes.
So why not encode U+EF00 (which in UTF8 is 0xEE 0xBC 0x80) as U+EFEE U+EFBC U+EF80, and so on? Doesn't it then become completely reversible?
This was also suggested by Mark Lentczner at the time I wrote the patch, but I raised a few objections (reproduced below): """ This would require us to: 1. Unconditionally decode these bytes sequences using the escape mechanism, even if using a non-roundtripping encoding. This is because the chars that result might be fed back into a roundtripping encoding, where they would otherwise get confused with escapes representing some other bytes. 2. Unconditonally decode these particular characters from escapes, even if using a non-roundtripping decoding -- necessary because of 1. Which are both a little annoying. Perhaps more seriously, it would play badly with e.g. reading in UTF-8 and writing out UTF-16, because your UTF-16 would have bits of UTF-8 representing these private-use chars embedded within it.. """ So although this is approach is somewhat attractive, I'm not sure the benefits of complete roundtripping outweigh the costs. This is why the unmodified PEP383 approach is kind of nice - it uses lone surrogate (rather than private use) codepoints to do the escaping, and these codepoints are simply not allowed to occur in valid UTF-encoded text. Max

On Wed, Nov 02, 2011 at 07:02:09PM +0000, Max Bolingbroke wrote: [snip some stuff I didn't understand. I think I made the mistake of entering a Unicode discussion]
This is why the unmodified PEP383 approach is kind of nice - it uses lone surrogate (rather than private use) codepoints to do the escaping, and these codepoints are simply not allowed to occur in valid UTF-encoded text.
If they do not occur, then why does it matter whether or not occurrences would get escaped? They are allowed to occur in Linux/ext2 filenames, anyway, and I think we ought to be able to handle them correctly if they do. Thanks Ian

On 2 November 2011 19:13, Ian Lynagh
[snip some stuff I didn't understand. I think I made the mistake of entering a Unicode discussion]
Sorry, perhaps that was too opaque! The problem is that if we commit to support occurrences of the private-use codepoint 0xEF80 then what happens if we: 1. Decode the UTF-32le data [0x80, 0xEF, 0x00, 0x00] to a string "\xEF80" 2. Pass the string "\xEF80" to a function that encodes it using an encoding which knows about the escaping mechanism. 3. Consequently encode "\xEF80" as [0x80] This seems a bit sad.
They are allowed to occur in Linux/ext2 filenames, anyway, and I think we ought to be able to handle them correctly if they do.
In Python, if a filename is decoded using UTF8 and the "surrogate escape" error handler, occurrences of lone surrogates are a decoding error because they are not allowed to occur in UTF-8 text. As a result the lone surrogate is put into the string escaped so it can be roundtripped back to a lone surrogate on output. So Python works OK. In GHC >= 7.2, if a filename is decoded using UTF8 and the "Roundtrip" error handler, occurrences of 0xEFNN are not a decoding error because they are perfectly fine Unicode codepoints. As a result they get put into the string unescaped, and so when we try to roundtrip the string we get the byte 0xNN in the output rather than the UTF-8 encoding of 0xEFNN. So GHC does not work OK in this situation :-( (The problem I outlined at the start of this email doesn't arise with the lone surrogate mechanism because surrogates aren't allowed in UTF-32 text either. So step 1 in the process would have failed with a decoding error.) Hope that helps, Max

On Wed, Nov 02, 2011 at 07:59:21PM +0000, Max Bolingbroke wrote:
On 2 November 2011 19:13, Ian Lynagh
wrote: They are allowed to occur in Linux/ext2 filenames, anyway, and I think we ought to be able to handle them correctly if they do.
In Python, if a filename is decoded using UTF8 and the "surrogate escape" error handler, occurrences of lone surrogates are a decoding error because they are not allowed to occur in UTF-8 text. As a result the lone surrogate is put into the string escaped so it can be roundtripped back to a lone surrogate on output. So Python works OK.
In GHC >= 7.2, if a filename is decoded using UTF8 and the "Roundtrip" error handler, occurrences of 0xEFNN are not a decoding error because they are perfectly fine Unicode codepoints. As a result they get put into the string unescaped, and so when we try to roundtrip the string we get the byte 0xNN in the output rather than the UTF-8 encoding of 0xEFNN. So GHC does not work OK in this situation :-(
Are you saying there's a bug that should be fixed? Thanks Ian

On 2 November 2011 20:16, Ian Lynagh
Are you saying there's a bug that should be fixed?
You can choose between two options: 1. Failing to roundtrip some strings (in our case, those containing the 0xEFNN byte sequences) 2. Having GHC's decoding functions return strings including codepoints that should not be allowed (i.e. lone surrogates) At the time I implemented this there was significant support for 2, so that is what we have. At the time I was convinced that 2 was the right thing to do, but now I'm more agnostic. But anyway the current behaviour is not really a bug -- it is by design :-) Max

On 02/11/2011 21:40, Max Bolingbroke wrote:
On 2 November 2011 20:16, Ian Lynagh
wrote: Are you saying there's a bug that should be fixed?
You can choose between two options:
1. Failing to roundtrip some strings (in our case, those containing the 0xEFNN byte sequences) 2. Having GHC's decoding functions return strings including codepoints that should not be allowed (i.e. lone surrogates)
At the time I implemented this there was significant support for 2, so that is what we have.
Don't you mean 1 is what we have?
At the time I was convinced that 2 was the right thing to do, but now I'm more agnostic. But anyway the current behaviour is not really a bug -- it is by design :-)
Failing to roundtrip in some cases, and doing so silently, seems highly suboptimal to me. I'm sorry I didn't pick up on this at the time (Unicode is a swamp :). Cheers, Simon

On 8 November 2011 11:43, Simon Marlow
Don't you mean 1 is what we have?
Yes, sorry!
Failing to roundtrip in some cases, and doing so silently, seems highly suboptimal to me. I'm sorry I didn't pick up on this at the time (Unicode is a swamp :).
I *can* change the implementation back to using lone surrogates. This gives us guaranteed roundtripping but it means that the user might see lone-surrogate Char values in Strings from the filesystem/command line. IIRC this does break some software -- e.g. Brian's "text" library explicitly checks for such characters and fails if it detects them. So whatever happens we are going to end up making some group of users unhappy! * No PEP383: Haskellers using non-ASCII get upset when their command line argument [String]s aren't in fact sequences of characters, but sequences of bytes in some arbitrary encoding * PEP383(surrogates): Unicoders get upset by lone surrogates (which can actually occur at the moment, independent of PEP383 -- e.g. as character literals or from FFI) * PEP383(private chars): Unixers get upset that we can't roundtrip byte sequences that look like the codepoint 0xEFXX encoded in the current locale. In practice, 0xEFXX is only decodable from a UTF encoding, so we fail to roundtrip byte sequences like the one Ian posted. I'm happy to implement any behaviour, I would just like to know that whatever it is is accepted as the correct tradeoff :-) RE exposing a ByteString based interface to the IO library from base/unix/whatever: AFAIK Python doesn't do this, and just tells people to use the (x.encode(sys.getfilesystemencoding(), "surrogateescape")) escape hatch, which is what I've been recommending. I think this would be more satisfying to John if it were actually guaranteed to work on arbitrary byte sequences, not just *highly likely* to work :-) Max

On 09/11/2011 10:39, Max Bolingbroke wrote:
On 8 November 2011 11:43, Simon Marlow
wrote: Don't you mean 1 is what we have?
Yes, sorry!
Failing to roundtrip in some cases, and doing so silently, seems highly suboptimal to me. I'm sorry I didn't pick up on this at the time (Unicode is a swamp :).
I *can* change the implementation back to using lone surrogates. This gives us guaranteed roundtripping but it means that the user might see lone-surrogate Char values in Strings from the filesystem/command line. IIRC this does break some software -- e.g. Brian's "text" library explicitly checks for such characters and fails if it detects them.
So whatever happens we are going to end up making some group of users unhappy! * No PEP383: Haskellers using non-ASCII get upset when their command line argument [String]s aren't in fact sequences of characters, but sequences of bytes in some arbitrary encoding * PEP383(surrogates): Unicoders get upset by lone surrogates (which can actually occur at the moment, independent of PEP383 -- e.g. as character literals or from FFI) * PEP383(private chars): Unixers get upset that we can't roundtrip byte sequences that look like the codepoint 0xEFXX encoded in the current locale. In practice, 0xEFXX is only decodable from a UTF encoding, so we fail to roundtrip byte sequences like the one Ian posted.
I'm happy to implement any behaviour, I would just like to know that whatever it is is accepted as the correct tradeoff :-)
I would be happy with the surrogate approach I think. Arguable if you try to treat a string with lone surrogates as Unicode and it fails, then that is a feature: the original string wasn't Unicode. All you can do with an invalid Unicode string is use it as a FilePath again, and the right thing will happen. Alternatively if we stick with the private char approach, it should be possible to have an escaping scheme for 0xEFxx characters in the input that would enable us to roundtrip correctly. That is, escape 0xEFxx into a sequence 0xYYEF 0xYYxx for some suitable YY. But perhaps that would be too expensive - an extra translation pass over the buffer after iconv (well, we do this for newline translation, so maybe it's not too bad).
RE exposing a ByteString based interface to the IO library from base/unix/whatever: AFAIK Python doesn't do this, and just tells people to use the (x.encode(sys.getfilesystemencoding(), "surrogateescape")) escape hatch, which is what I've been recommending. I think this would be more satisfying to John if it were actually guaranteed to work on arbitrary byte sequences, not just *highly likely* to work :-)
The performance overhead of all this worries me. withCString has taken a huge performance hit, and I think there are people who wnat to know that there aren't several complex encoding/decoding passes between their Haskell code and the POSIX API. We ought to be able to program to POSIX directly, and the same goes for Win32. Cheers, Simon

On Wed, Nov 09, 2011 at 11:02:54AM +0000, Simon Marlow wrote:
I would be happy with the surrogate approach I think. Arguable if you try to treat a string with lone surrogates as Unicode and it fails, then that is a feature: the original string wasn't Unicode. All you can do with an invalid Unicode string is use it as a FilePath again, and the right thing will happen.
If we aren't going to guarantee that the encoded string is unicode, then is there any benefit to encoding it in the first place?
Alternatively if we stick with the private char approach, it should be possible to have an escaping scheme for 0xEFxx characters in the input that would enable us to roundtrip correctly. That is, escape 0xEFxx into a sequence 0xYYEF 0xYYxx for some suitable YY.
Why not encode into private chars, i.e. encode U+EF00 (which in UTF8 is 0xEE 0xBC 0x80) as U+EFEE U+EFBC U+EF80, etc? (Max gave some reasons earlier in this thread, but I'd need examples of what goes wrong to understand them). Thanks Ian

On 9 November 2011 13:11, Ian Lynagh
If we aren't going to guarantee that the encoded string is unicode, then is there any benefit to encoding it in the first place?
(I think you mean decoded here - my understanding is that decode :: ByteString -> String, encode :: String -> ByteString)
Why not encode into private chars, i.e. encode U+EF00 (which in UTF8 is 0xEE 0xBC 0x80) as U+EFEE U+EFBC U+EF80, etc?
(Max gave some reasons earlier in this thread, but I'd need examples of what goes wrong to understand them).
We can do this but it doesn't solve all problems. Here are two such problems: PROBLEM 1 (bleeding from non-escaping to escaping TextEncodings) === So let's say we are reading a filename from stdin. Currently stdin uses the utf8 TextEncoding -- this TextEncoding knows nothing about private-char roundtripping, and will throw an exception when decoding bad bytes or encoding our private chars. Now the user types a UTF-8 U+EF80 character - i.e. we get the bytes 0xEE 0xBC 0x80 on stdin. The utf8 TextEncoding naively decodes this byte sequence to the character sequence U+EF80. We have lost at this point: if the user supplies the resulting String to a function that encodes the String with the fileSystemEncoding, the String will be encoded into the byte sequence 0x80. This is probably not what we want to happen! It means that a program like this: """ main = do fp <- getLine readFile fp >>= putStrLn """ Will fail ("file not found: \x80") when given the name of an (existant) file 0xEE 0xBC 0x80. PROBLEM 2 (bleeding between two different escaping TextEncodings) === So let's say the user supplies the UTF-8 encoded U+EF00 (byte sequence 0xEE 0xBC 0x80) as a command line argument, so it goes through the fileSystemEncoding. In your scheme the resulting Char sequence is U+EFEE U+EFBC U+EF80. What happens when we that *encode* that Char sequence using a UTF-16 TextEncoding (that knows about the 0xEFxx escape mechanism)? The resulting byte sequence is 0xEE 0xBC 0x80, NOT the UTF-16 encoded version of U+EF00! This is certainly contrary to what the user would expect. PROBLEM 3 (bleeding from escaping to non-escaping TextEncodings) === Just as above, let's say the user supplies the UTF-8 encoded U+EF00 (byte sequence 0xEE 0xBC 0x80) as a command line argument, so it goes through the fileSystemEncoding. In your scheme the resulting Char sequence is U+EFEE U+EFBC U+EF80. If you try to write this String to stdout (which uses the UTF-8 encoding that knows nothing about 0xEFxx escapes) you just get an exception, NOT the UTF-8 encoded version of U+EF00. Game over man, game over! CONCLUSION === As far as I can see, the proposed escaping scheme recovers the roundtrip property but fails to regain a lot of other reasonable-looking behaviours. (Note that the above outlined problems are problems in the current implementation too -- but the current implementation doesn't even pretend to support U+EFxx characters. Its correctness is entirely dependent on them never showing up, which is why we chose a part of the private codepoint region that is reserved specifically for the purpose of encoding hacks). Max

On 09/11/2011 15:58, Max Bolingbroke wrote:
(Note that the above outlined problems are problems in the current implementation too -- but the current implementation doesn't even pretend to support U+EFxx characters. Its correctness is entirely dependent on them never showing up, which is why we chose a part of the private codepoint region that is reserved specifically for the purpose of encoding hacks).
But we can't make that assumption, because the user might have accidentally set the locale wrong and then all kinds of garbage will show up in decoded file paths. I think it's important that programs that just traverse the file system keep working under those conditions, rather than randomly failing due to (encode . decode) being almost but not quite the identity. Cheers, Simon

My primary concerns are (in order of priority - and I only speak for myself) (a) consistency across platforms (b) minimize (unrequired) performance overhead I would prefer an api which is consistent for both win32, posix or other os which only did as much as what the user (us) wanted for example ... module System.Directory.ByteString ... FilePath = ByteString getDirectoryContents :: FilePath -> IO [FilePath] which is the same for both win32 and posix and represents raw uninterpreted bytestrings in whatever encoding/(non-encoding) the os provides....implicitly it is for the user to know and understand what their getting (utf-16 in the case of windows, bytes in case of posix platforms) then this api can be re-exported with the decoding/encoding by System.Directory/System.IO which would export FilePath=String ie a two level api...

On Wed, Nov 09, 2011 at 03:58:47PM +0000, Max Bolingbroke wrote:
(Note that the above outlined problems are problems in the current implementation too
Then the proposal seems to me to be strictly better than the current system. Under both systems the wrong thing happen when U+EFxx is entered as unicode text, but the proposed system works for all filenames read from the filesystem. In the longer term, I think we need to fix the underlying problem that (for example) both getLine and getArgs produce a String from bytes, but do so in different ways. At some point we should change the type of getArgs and friends. Thanks Ian

On 10 November 2011 00:17, Ian Lynagh
On Wed, Nov 09, 2011 at 03:58:47PM +0000, Max Bolingbroke wrote:
(Note that the above outlined problems are problems in the current implementation too
Then the proposal seems to me to be strictly better than the current system. Under both systems the wrong thing happen when U+EFxx is entered as unicode text, but the proposed system works for all filenames read from the filesystem.
Your proposal is not *strictly* better than what is implemented in at least the following ways: 1. With your proposal, if you read a filename containing U+EF80 into the variable "fp" and then expect the character U+EF80 to be in fp you will be surprised to only find its escaped form. In the current implementation you will in fact find U+EF80. 2. The performance of iconv-based decoders will suffer because we will need to do a post-pass in the TextEncoding to do this extra escaping for U+EFxx characters I'm really not keen about implementing a fix that addresses such a limited subset of the problems, anyway.
In the longer term, I think we need to fix the underlying problem that (for example) both getLine and getArgs produce a String from bytes, but do so in different ways. At some point we should change the type of getArgs and friends.
I'm not sure about this. hGetLine produces a String from bytes in a different way depending on the encoding set on the Handle, but we don't try to differentiate in the type system between Strings decoded using different TextEncodings. Why should getLine and getArgs be different? If you are really unhappy about getLine and getArgs having different behaviour in this sense, one option would be to change the default stdout/stdin TextEncoding to use the fileSystemEncoding that knows about escapes. (Note that this would mean that your Haskell program wouldn't immediately die if you were using the UTF8 locale and then tried to read some non-UTF8 input from stdin, which might or might not be a good thing, depending on the application.) Max

On 09/11/2011 13:11, Ian Lynagh wrote:
On Wed, Nov 09, 2011 at 11:02:54AM +0000, Simon Marlow wrote:
I would be happy with the surrogate approach I think. Arguable if you try to treat a string with lone surrogates as Unicode and it fails, then that is a feature: the original string wasn't Unicode. All you can do with an invalid Unicode string is use it as a FilePath again, and the right thing will happen.
If we aren't going to guarantee that the encoded string is unicode, then is there any benefit to encoding it in the first place?
With a decoded FilePath you can: - use it as a FilePath argument to some other function - map all the illegal characters to '?' and then treat it as Unicode, e.g. for printing it out (but then you lost the ability to roundtrip, which is why we can't do this automatically). Ok, so since we need something like makePrintable :: FilePath -> String arguably we might as well make that do the locale decoding. That's certainly a good point... Cheers, Simon

On 9 November 2011 16:29, Simon Marlow
Ok, so since we need something like
makePrintable :: FilePath -> String
arguably we might as well make that do the locale decoding. That's certainly a good point...
You could, but getArgs :: IO [String], not :: IO [FilePath]. And locale-decoding command-line arguments is the Right Thing To Do. So this doesn't really avoid the need to roundtrip, does it? Is there any consensus about what to do here? My take is that we should move back to lone surrogates. This: 1. Recovers the roundtrip property, which we appear to believe is essential 2. Removes all the weird problems I outlined earlier that can occur if your byte strings happen to contain some bytes that decode to U+EFxx 3. DOES break software that expects Strings not to contain surrogate codepoints, but (I agree with you) this is arguably a feature This is also exactly what Python does so it has the advantage of being battle tested. Agreed? We can additionally: * Provide your layer in the "unix" package where FilePath = ByteString, for people who for some reason care about performance of their FilePath encoding/decoding, OR who don't want to rely on the roundtripping property being implemented correctly * Perhaps provide a layer in the "win32" package where FilePath = ByteString but where that ByteString is guaranteed to be UTF-16 encoded (I'm less sure about this, because we can always unambiguously decode this without doing any escaping. It's still useful if you care about performance.) I'm wondering if we should also have hSetLocaleEncoding, hSetFileSystemEncoding :: TextEncoding -> IO () and change localeEncoding, fileSystemEncoding :: IO TextEncoding. hSetFileSystemEncoding in particular would let people opt-out of escapes entirely as long as they issued it right at the start of their program before the fileSystemEncoding had been used. What do you think? Max

On 10/11/2011 09:28, Max Bolingbroke wrote:
Is there any consensus about what to do here? My take is that we should move back to lone surrogates. This: 1. Recovers the roundtrip property, which we appear to believe is essential 2. Removes all the weird problems I outlined earlier that can occur if your byte strings happen to contain some bytes that decode to U+EFxx 3. DOES break software that expects Strings not to contain surrogate codepoints, but (I agree with you) this is arguably a feature
This is also exactly what Python does so it has the advantage of being battle tested.
Agreed?
Agreed.
We can additionally: * Provide your layer in the "unix" package where FilePath = ByteString, for people who for some reason care about performance of their FilePath encoding/decoding, OR who don't want to rely on the roundtripping property being implemented correctly
I think I'll do this anyway.
* Perhaps provide a layer in the "win32" package where FilePath = ByteString but where that ByteString is guaranteed to be UTF-16 encoded (I'm less sure about this, because we can always unambiguously decode this without doing any escaping. It's still useful if you care about performance.)
I'm wondering if we should also have hSetLocaleEncoding, hSetFileSystemEncoding :: TextEncoding -> IO () and change localeEncoding, fileSystemEncoding :: IO TextEncoding. hSetFileSystemEncoding in particular would let people opt-out of escapes entirely as long as they issued it right at the start of their program before the fileSystemEncoding had been used.
Ok by me. Cheers, Simon

On 10 November 2011 14:35, Simon Marlow
Agreed.
Committed.
I'm wondering if we should also have hSetLocaleEncoding, hSetFileSystemEncoding :: TextEncoding -> IO () and change localeEncoding, fileSystemEncoding :: IO TextEncoding. hSetFileSystemEncoding in particular would let people opt-out of escapes entirely as long as they issued it right at the start of their program before the fileSystemEncoding had been used.
Ok by me.
I've done this as well. One wart is that System.IO.localeEncoding :: TextEncoding and I dn't want to break that API. So the System.IO.localeEncoding is always the *initial* locale encoding and does not reflect later changes via setLocaleEncoding. Max

On 9 November 2011 11:02, Simon Marlow
The performance overhead of all this worries me. withCString has taken a huge performance hit, and I think there are people who wnat to know that there aren't several complex encoding/decoding passes between their Haskell code and the POSIX API. We ought to be able to program to POSIX directly, and the same goes for Win32.
We are only really talking about environment variables, filenames and command line arguments here. I'm sure there are performance implications to all this decoding/encoding, but these bits of text are almost always very short and are unlikely to be causing bottlenecks. Adding a whole new API *just* to eliminate a hypothetical performance problem seems like overkill. OTOH, I'm happy to add it if we stick with using private chars for the escapes, because then using it or not using it is a *correctness* issue (albeit in rare cases). Max
participants (13)
-
Daniel Peebles
-
David Brown
-
Donn Cave
-
Felipe Almeida Lessa
-
Ganesh Sittampalam
-
Ian Lynagh
-
Jean-Marie Gaillourdet
-
John Lask
-
John Millikin
-
Max Bolingbroke
-
Simon Marlow
-
wren ng thornton
-
Yitzchak Gale