Encoding-aware System.Directory functions

Hi all, I think this is a well-known issue: it seems that there is no character decoding performed on the values returned from the functions in System.Directory (getDirectoryContents specifically). I could manually do something like (utf8Decode . S8.pack), but that presumes that the character encoding on the system in question is UTF8. So two questions: * Is there a package out there that handles all the gory details for me automatically, and simply returns a properly decoded String (or Text)? * If not, is there a standard way to determine the character encoding used by the filesystem, short of hard-coding in character encodings used by the major ones? For those curious: this is in regards to a bug in wai-app-static[1]. Thanks, Michael [1] http://hackage.haskell.org/package/wai-app-static

On Tue, Mar 29, 2011 at 11:52 PM, Michael Snoyman
Hi all,
I think this is a well-known issue: it seems that there is no character decoding performed on the values returned from the functions in System.Directory (getDirectoryContents specifically). I could manually do something like (utf8Decode . S8.pack), but that presumes that the character encoding on the system in question is UTF8. So two questions:
* Is there a package out there that handles all the gory details for me automatically, and simply returns a properly decoded String (or Text)? * If not, is there a standard way to determine the character encoding used by the filesystem, short of hard-coding in character encodings used by the major ones?
I started to write a thoughtful reply, but I found that the answers here sum up everything I was going to say: http://unix.stackexchange.com/questions/2089/what-charset-encoding-is-used-f... This same issue comes up from time to time for darcs and, if I recall correctly, the solution has been to treat unix file paths as arbitrary bytes whenever possible and to escape non-ascii compatible bytes when they occur. Otherwise it can be hard to encode them in textual patch descriptions or xml (where an encoding is required and I believe utf8 is a standard default). I wish you luck. It's not as easy problem, at least on unix. I've heard that windows has a much easier time here as MS has provided a standard for it. Jason

On Wed, Mar 30, 2011 at 09:26, Jason Dagit
On Tue, Mar 29, 2011 at 11:52 PM, Michael Snoyman
wrote: Hi all,
I think this is a well-known issue: it seems that there is no character decoding performed on the values returned from the functions in System.Directory (getDirectoryContents specifically). I could manually do something like (utf8Decode . S8.pack), but that presumes that the character encoding on the system in question is UTF8. So two questions:
* Is there a package out there that handles all the gory details for me automatically, and simply returns a properly decoded String (or Text)? * If not, is there a standard way to determine the character encoding used by the filesystem, short of hard-coding in character encodings used by the major ones?
I started to write a thoughtful reply, but I found that the answers here sum up everything I was going to say:
http://unix.stackexchange.com/questions/2089/what-charset-encoding-is-used-f...
This same issue comes up from time to time for darcs and, if I recall correctly, the solution has been to treat unix file paths as arbitrary bytes whenever possible and to escape non-ascii compatible bytes when they occur. Otherwise it can be hard to encode them in textual patch descriptions or xml (where an encoding is required and I believe utf8 is a standard default).
I wish you luck. It's not as easy problem, at least on unix. I've heard that windows has a much easier time here as MS has provided a standard for it.
All the more reason it seems to make this available in the standard package, so people don't have to figure out how to the conversions each time (for all the different OSes with whcih they might not have any experience etc) . All modern Linuxes use UTF8 by default anyway so in the beginning one could assume UTF8 and later change the system to be able to make more intelligent decisions (like checking environment variables for per-user settings). A way to override the assumptions made would be necessary too I guess. -Tako

On Wed, Mar 30, 2011 at 9:26 AM, Jason Dagit
On Tue, Mar 29, 2011 at 11:52 PM, Michael Snoyman
wrote: Hi all,
I think this is a well-known issue: it seems that there is no character decoding performed on the values returned from the functions in System.Directory (getDirectoryContents specifically). I could manually do something like (utf8Decode . S8.pack), but that presumes that the character encoding on the system in question is UTF8. So two questions:
* Is there a package out there that handles all the gory details for me automatically, and simply returns a properly decoded String (or Text)? * If not, is there a standard way to determine the character encoding used by the filesystem, short of hard-coding in character encodings used by the major ones?
I started to write a thoughtful reply, but I found that the answers here sum up everything I was going to say: http://unix.stackexchange.com/questions/2089/what-charset-encoding-is-used-f... This same issue comes up from time to time for darcs and, if I recall correctly, the solution has been to treat unix file paths as arbitrary bytes whenever possible and to escape non-ascii compatible bytes when they occur. Otherwise it can be hard to encode them in textual patch descriptions or xml (where an encoding is required and I believe utf8 is a standard default). I wish you luck. It's not as easy problem, at least on unix. I've heard that windows has a much easier time here as MS has provided a standard for it. Jason
Thanks to you (and everyone else) for the informative responses. For now, I've simply hard-coded in UTF-8 encoding for all non-Windows systems. I'm not sure how this will play with OSes besides Windows and Linux (especially Mac), but it's a good stop-gap measure. I *do* think it would be incredibly useful to provide alternatives to all the standard operations on FilePath which used opaque datatypes and properly handles filename encoding. I noticed John Millikin's system-filepath package[1]. Do people have experience with it? It seems that adding a few functions like getDirectoryContents, plus adding a version of toString which performs some character decoding, would get us pretty far. Michael [1] http://hackage.haskell.org/package/system-filepath

On 30 March 2011 18:07, Michael Snoyman
On Wed, Mar 30, 2011 at 9:26 AM, Jason Dagit
wrote: On Tue, Mar 29, 2011 at 11:52 PM, Michael Snoyman
wrote: Hi all,
I think this is a well-known issue: it seems that there is no character decoding performed on the values returned from the functions in System.Directory (getDirectoryContents specifically). I could manually do something like (utf8Decode . S8.pack), but that presumes that the character encoding on the system in question is UTF8. So two questions:
* Is there a package out there that handles all the gory details for me automatically, and simply returns a properly decoded String (or Text)? * If not, is there a standard way to determine the character encoding used by the filesystem, short of hard-coding in character encodings used by the major ones?
I started to write a thoughtful reply, but I found that the answers here sum up everything I was going to say: http://unix.stackexchange.com/questions/2089/what-charset-encoding-is-used-f... This same issue comes up from time to time for darcs and, if I recall correctly, the solution has been to treat unix file paths as arbitrary bytes whenever possible and to escape non-ascii compatible bytes when they occur. Otherwise it can be hard to encode them in textual patch descriptions or xml (where an encoding is required and I believe utf8 is a standard default). I wish you luck. It's not as easy problem, at least on unix. I've heard that windows has a much easier time here as MS has provided a standard for it. Jason
Thanks to you (and everyone else) for the informative responses. For now, I've simply hard-coded in UTF-8 encoding for all non-Windows systems. I'm not sure how this will play with OSes besides Windows and Linux (especially Mac), but it's a good stop-gap measure.
I *do* think it would be incredibly useful to provide alternatives to all the standard operations on FilePath which used opaque datatypes and properly handles filename encoding. I noticed John Millikin's system-filepath package[1]. Do people have experience with it? It seems that adding a few functions like getDirectoryContents, plus adding a version of toString which performs some character decoding, would get us pretty far.
Michael
[1] http://hackage.haskell.org/package/system-filepath
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
It would also be great to have a package which combines the proper encoding/decoding of filepaths of the system-filepath package with the type-safety of the pathtype package: http://hackage.haskell.org/package/pathtype Bas

On Wednesday, March 30, 2011 12:18:48 PM UTC-7, Bas van Dijk wrote:
It would also be great to have a package which combines the proper encoding/decoding of filepaths of the system-filepath package with the type-safety of the pathtype package: http://hackage.haskell.org/package/pathtype
Does that package actually work well? I don't see how it can; it's not possible to determine whether a path like "/foo/bar" or "C:\foo\bar" refers to a file or directory, so any user input has to be [[ Path ar fd ]]. And since the filesystem's out of our control, even functions like [[ checkType :: Path ar fd -> IO (Either (FilePath ar) (DirPath ar)) can't provide any meaningful result. And that's before getting into UNIX symlinks, which can be files and directories at the same time.

On 30/03/2011 08:18 PM, Bas van Dijk wrote:
It would also be great to have a package which combines the proper encoding/decoding of filepaths of the system-filepath package with the type-safety of the pathtype package: http://hackage.haskell.org/package/pathtype
Oh sweet! I was just about to write a package exactly like this. Apparently I don't need to. :-)

On Wednesday, March 30, 2011 9:07:45 AM UTC-7, Michael Snoyman wrote:
Thanks to you (and everyone else) for the informative responses. For now, I've simply hard-coded in UTF-8 encoding for all non-Windows systems. I'm not sure how this will play with OSes besides Windows and Linux (especially Mac), but it's a good stop-gap measure.
Linux, OSX, and (probably?) FreeBSD use UTF8. It's *possible* for a Linux
file path to contain arbitrary bytes, but every application I've ever seen just gives up and writes [[invalid character]] symbols when confronted with such. OSX's chief weirdness is that its GUI programs swap ':' and '/' when displaying filenames. So the file "hello:world.txt" will show up as "hello/world.txt" in Finder. It also performs Unicode normalization on your filenames, which is mostly harmless but can have unexpected results on unicode-naïve applications like rsync.** I don't know how its normalization interacts with invalid file paths, or whether it even allows such paths to be written. Window's weirdness is its multi-root filesystem, and also that it distinguishes between absolute and non-relative paths. The Windows path "/foo.txt" is *not* absolute and *not* relative. I've never been able to figure out how Windows does Unicode; it seems to have a half-dozen APIs for it, all subtly different, and not a single damn one displays anything but "???????.txt" when I download anything east-Asian. I *do* think it would be incredibly useful to provide alternatives to
all the standard operations on FilePath which used opaque datatypes and properly handles filename encoding. I noticed John Millikin's system-filepath package[1]. Do people have experience with it? It seems that adding a few functions like getDirectoryContents, plus adding a version of toString which performs some character decoding, would get us pretty far.
system-filepath was my frustration with the somewhat bizarre behavior of some functions in "filepath"; I designed it to match the Python os.path API pretty closely. I don't think it has any client code outside of my ~/bin , so changing its API radically shouldn't cause any drama. I'd prefer filesystem manipulation functions be put in a separate library (perhaps "system-directory"?), to match the current filepath/directory split. If it's to contain encoding-aware functions, I think they should be Text-only. The existing String-based are just to interact with legacy functions in System.IO, and should be either renamed to "toChar8/fromChar8" or removed entirely. My vote to the second -- if someone needs Char8 strings, they can convert from the ByteString version explicitly. -------------------------------------- -- | Try to decode a FilePath to Text, using the current locale encoding. If -- the filepath is invalid in the current locale, it is decoded as ASCII and -- any non-ASCII bytes are replaced with a placeholder. -- -- The returned text is useful only for display to the user. It might not be -- possible to convert back to the same or any 'FilePath'. toText :: FilePath -> Text -- | Try to encode Text to a FilePath, using the current locale encoding. If -- the text cannot be represented in the current locale, returns 'Nothing'. fromText :: Text -> Maybe FilePath --------------------------------------

On Wed, Mar 30, 2011 at 21:07, Ivan Lazar Miljenovic < ivan.miljenovic@gmail.com> wrote:
On 31 March 2011 14:51, John Millikin
wrote: Linux, OSX, and (probably?) FreeBSD use UTF8.
For Linux, doesn't it depend upon the locale rather than forcing UTF-8?
In theory, yes. There are environment to specify the locale encoding, and some applications attempt to obey them. In practice, no. Both Qt and GTK+ use UTF8 internally, and react poorly when run on a non-UTF8 system. Every major distribution sets the locale encoding to UTF8. Setting a non-UTF8 encoding requires digging through various undocumented configuration files, and even then many applications will simply ignore it and use UTF8 anyway.

John Millikin
OSX's chief weirdness is that its GUI programs swap ':' and '/' when displaying filenames.
A remnant from the bad old days of MacOS <10, where : was the path separator, and / was a perfectly good character to use in filenames.
-- | Try to decode a FilePath to Text, using the current locale encoding. If -- the filepath is invalid in the current locale, it is decoded as ASCII and -- any non-ASCII bytes are replaced with a placeholder.
Why not map them to individual placeholders, i.e. in a private Unicode area? This way, the conversion could be invertible, and possibly even sensibly displayable given a guesstimated alternative locale (e.g. as Latin 1 if the locale is a West European one). -k -- If I haven't seen further, it is by standing in the footprints of giants

On 31 March 2011 09:13, Ketil Malde
-- | Try to decode a FilePath to Text, using the current locale encoding. If -- the filepath is invalid in the current locale, it is decoded as ASCII and -- any non-ASCII bytes are replaced with a placeholder.
Why not map them to individual placeholders, i.e. in a private Unicode area?
This way, the conversion could be invertible, and possibly even sensibly displayable given a guesstimated alternative locale (e.g. as Latin 1 if the locale is a West European one).
This is what Python's PEP 383 proposes, and it is implemented in Python 3. This means that you can treat file names as strings uniformly (which is really nice), but does have disadvantages: for example, printing a string to a UTF-8 console can throw an exception if the string contains one of these "surrogate characters". Cheers, Max

On 31/03/2011, at 9:13 PM, Ketil Malde wrote:
John Millikin
writes: OSX's chief weirdness is that its GUI programs swap ':' and '/' when displaying filenames.
A remnant from the bad old days of MacOS <10, where : was the path separator, and / was a perfectly good character to use in filenames.
And indeed very frequently used for dates. There had to be _some_ way to deal with old file names in the new OS. From a UNIX point of view, one peculiarity of the Mac OS X native file system is that while case preserving for creation, it is case insensitive for lookup. On my Mac laptop, where everything is native, this actually works very pleasantly, although it does mean that "fn1 and fn2 are equivalent for lookup, whatever the state of the file system" and "fn1 and fn2 are equivalent for creation, whatever the state of the file system" are different propositions. On my Mac desktop, where _some_ directories are native and _some_ are NFS, it can get confusing. If memory serves me, Mac OS Classic recorded the script of each file name, so in a multidirectory path foo:bar:ugh:zoo each component might be byte encoded in a different script...

On 30 March 2011 07:52, Michael Snoyman
I could manually do something like (utf8Decode . S8.pack), but that presumes that the character encoding on the system in question is UTF8. So two questions:
Funnily enough I have been thinking about this quite hard recently, and the situation is kind of a mess and short of implementing PEP383 (http://www.python.org/dev/peps/pep-0383/) in GHC I can't see how to make it easier on the programmer. As Jason points out the best you can really do is probably: 1. Treat Strings that represent filenames as raw byte sequences, even though they claim to be strings 2. When presenting such Strings to the user, re-decode them by using the current locale encoding (which will typically be UTF-8). You probably want to have some means of avoiding decoding errors here too -- ignoring or replacing undecodable bytes -- but presently this is not so straightforward. If you happen to be on a system with GNU Iconv you can use it's "C//TRANSLIT//IGNORE" encoding to achieve this, however. Cheers, Max

On 30 March 2011 20:53, Max Bolingbroke
On 30 March 2011 07:52, Michael Snoyman
wrote: I could manually do something like (utf8Decode . S8.pack), but that presumes that the character encoding on the system in question is UTF8. So two questions:
Funnily enough I have been thinking about this quite hard recently, and the situation is kind of a mess and short of implementing PEP383 (http://www.python.org/dev/peps/pep-0383/) in GHC I can't see how to make it easier on the programmer. As Jason points out the best you can really do is probably:
1. Treat Strings that represent filenames as raw byte sequences, even though they claim to be strings
2. When presenting such Strings to the user, re-decode them by using the current locale encoding (which will typically be UTF-8). You probably want to have some means of avoiding decoding errors here too -- ignoring or replacing undecodable bytes -- but presently this is not so straightforward. If you happen to be on a system with GNU Iconv you can use it's "C//TRANSLIT//IGNORE" encoding to achieve this, however.
http://www.haskell.org/pipermail/libraries/2009-August/012493.html I took from this discussion that FilePath really should be a pair of the actual filename ByteString, and the printable String (decoded from the ByteString, with encoding specified by the user's locale). The conversion from ByteString to String (and vice versa) is not guaranteed to be lossless, so you need to remember both. Alistair

On Wed, Mar 30, 2011 at 11:01, Alistair Bayley
On 30 March 2011 20:53, Max Bolingbroke
wrote: On 30 March 2011 07:52, Michael Snoyman
wrote: I could manually do something like (utf8Decode . S8.pack), but that presumes that the character encoding on the system in question is UTF8. So two questions:
Funnily enough I have been thinking about this quite hard recently, and the situation is kind of a mess and short of implementing PEP383 (http://www.python.org/dev/peps/pep-0383/) in GHC I can't see how to make it easier on the programmer. As Jason points out the best you can really do is probably:
1. Treat Strings that represent filenames as raw byte sequences, even though they claim to be strings
2. When presenting such Strings to the user, re-decode them by using the current locale encoding (which will typically be UTF-8). You probably want to have some means of avoiding decoding errors here too -- ignoring or replacing undecodable bytes -- but presently this is not so straightforward. If you happen to be on a system with GNU Iconv you can use it's "C//TRANSLIT//IGNORE" encoding to achieve this, however.
http://www.haskell.org/pipermail/libraries/2009-August/012493.html
I took from this discussion that FilePath really should be a pair of the actual filename ByteString, and the printable String (decoded from the ByteString, with encoding specified by the user's locale). The conversion from ByteString to String (and vice versa) is not guaranteed to be lossless, so you need to remember both.
I'm not sure that I agree with that. Why does it have to be loss-less? The problem, more likely, is the fact that FilePath is just a simple string. Maybe we should go the way of Java where cross-platform file access is based upon a File (or the new Path) type? That way the internal representation could use whatever necessary to ensure a unique reference to a file or directory while at the same time providing a way to get a human-readable representation. Going from strings to file/path types would need the correct encodings to work. Cheers, -Tako PS: Just lurking here most of the time because I'm still a total Haskell noob, you can ignore me without risk.

On 30 March 2011 10:20, Tako Schotanus
http://www.haskell.org/pipermail/libraries/2009-August/012493.html I took from this discussion that FilePath really should be a pair of the actual filename ByteString, and the printable String (decoded from the ByteString, with encoding specified by the user's locale). The conversion from ByteString to String (and vice versa) is not guaranteed to be lossless, so you need to remember both.
My understanding is that the ByteString is the one "source of truth" about what the file is called, and you can derive the String from that by assuming some encoding, which is what I proposed in my earlier message. I guess that as an optimisation you could cache the String decoded with a particular encoding as well, but to my mind it's not obviously worth it.
I'm not sure that I agree with that. Why does it have to be loss-less? The problem, more likely, is the fact that FilePath is just a simple string. Maybe we should go the way of Java where cross-platform file access is based upon a File (or the new Path) type?
An opaque Path type has been discussed before and would indeed help a lot, but it would break backwards compatibility in a fairly major way. It might be worth it, though. Max
participants (11)
-
Alistair Bayley
-
Andrew Coppin
-
Bas van Dijk
-
Ivan Lazar Miljenovic
-
Jason Dagit
-
John Millikin
-
Ketil Malde
-
Max Bolingbroke
-
Michael Snoyman
-
Richard O'Keefe
-
Tako Schotanus