
On Wednesday, March 30, 2011 9:07:45 AM UTC-7, Michael Snoyman wrote:
Thanks to you (and everyone else) for the informative responses. For now, I've simply hard-coded in UTF-8 encoding for all non-Windows systems. I'm not sure how this will play with OSes besides Windows and Linux (especially Mac), but it's a good stop-gap measure.
Linux, OSX, and (probably?) FreeBSD use UTF8. It's *possible* for a Linux
file path to contain arbitrary bytes, but every application I've ever seen just gives up and writes [[invalid character]] symbols when confronted with such. OSX's chief weirdness is that its GUI programs swap ':' and '/' when displaying filenames. So the file "hello:world.txt" will show up as "hello/world.txt" in Finder. It also performs Unicode normalization on your filenames, which is mostly harmless but can have unexpected results on unicode-naïve applications like rsync.** I don't know how its normalization interacts with invalid file paths, or whether it even allows such paths to be written. Window's weirdness is its multi-root filesystem, and also that it distinguishes between absolute and non-relative paths. The Windows path "/foo.txt" is *not* absolute and *not* relative. I've never been able to figure out how Windows does Unicode; it seems to have a half-dozen APIs for it, all subtly different, and not a single damn one displays anything but "???????.txt" when I download anything east-Asian. I *do* think it would be incredibly useful to provide alternatives to
all the standard operations on FilePath which used opaque datatypes and properly handles filename encoding. I noticed John Millikin's system-filepath package[1]. Do people have experience with it? It seems that adding a few functions like getDirectoryContents, plus adding a version of toString which performs some character decoding, would get us pretty far.
system-filepath was my frustration with the somewhat bizarre behavior of some functions in "filepath"; I designed it to match the Python os.path API pretty closely. I don't think it has any client code outside of my ~/bin , so changing its API radically shouldn't cause any drama. I'd prefer filesystem manipulation functions be put in a separate library (perhaps "system-directory"?), to match the current filepath/directory split. If it's to contain encoding-aware functions, I think they should be Text-only. The existing String-based are just to interact with legacy functions in System.IO, and should be either renamed to "toChar8/fromChar8" or removed entirely. My vote to the second -- if someone needs Char8 strings, they can convert from the ByteString version explicitly. -------------------------------------- -- | Try to decode a FilePath to Text, using the current locale encoding. If -- the filepath is invalid in the current locale, it is decoded as ASCII and -- any non-ASCII bytes are replaced with a placeholder. -- -- The returned text is useful only for display to the user. It might not be -- possible to convert back to the same or any 'FilePath'. toText :: FilePath -> Text -- | Try to encode Text to a FilePath, using the current locale encoding. If -- the text cannot be represented in the current locale, returns 'Nothing'. fromText :: Text -> Maybe FilePath --------------------------------------