Re: [Haskell-cafe] Encoding-aware System.Directory functions

31 Mar 2011

      On Wednesday, March 30, 2011 9:07:45 AM UTC-7, Michael Snoyman wrote:
...
Thanks to you (and everyone else) for the informative responses. For
now, I've simply hard-coded in UTF-8 encoding for all non-Windows
systems. I'm not sure how this will play with OSes besides Windows and
Linux (especially Mac), but it's a good stop-gap measure.
Linux, OSX, and (probably?) FreeBSD use UTF8. It's *possible* for a Linux
file path to contain arbitrary bytes, but every application I've ever seen 
just gives up and writes [[invalid character]] symbols when confronted with 
such.

OSX's chief weirdness is that its GUI programs swap ':' and '/' when 
displaying filenames. So the file "hello:world.txt" will show up as 
"hello/world.txt" in Finder. It also performs Unicode normalization on your 
filenames, which is mostly harmless but can have unexpected results on 
unicode-naïve applications like rsync.** I don't know how its normalization 
interacts with invalid file paths, or whether it even allows such paths to 
be written.

Window's weirdness is its multi-root filesystem, and also that it 
distinguishes between absolute and non-relative paths. The Windows path 
"/foo.txt" is *not* absolute and *not* relative. I've never been able to 
figure out how Windows does Unicode; it seems to have a half-dozen APIs for 
it, all subtly different, and not a single damn one displays anything but 
"???????.txt" when I download anything east-Asian.

I *do* think it would be incredibly useful to provide alternatives to
...
all the standard operations on FilePath which used opaque datatypes
and properly handles filename encoding. I noticed John Millikin's
system-filepath package[1]. Do people have experience with it? It
seems that adding a few functions like getDirectoryContents, plus
adding a version of toString which performs some character decoding,
would get us pretty far.
system-filepath was my frustration with the somewhat bizarre behavior of 
some functions in "filepath"; I designed it to match the Python os.path API 
pretty closely. I don't think it has any client code outside of my ~/bin , 
so changing its API radically shouldn't cause any drama.

I'd prefer filesystem manipulation functions be put in a separate library 
(perhaps "system-directory"?), to match the current filepath/directory 
split.

If it's to contain encoding-aware functions, I think they should be 
Text-only. The existing String-based are just to interact with legacy 
functions in System.IO, and should be either renamed to "toChar8/fromChar8" 
or removed entirely. My vote to the second -- if someone needs Char8 
strings, they can convert from the ByteString version explicitly.

--------------------------------------
-- | Try to decode a FilePath to Text, using the current locale encoding. If
-- the filepath is invalid in the current locale, it is decoded as ASCII and
-- any non-ASCII bytes are replaced with a placeholder.
--
-- The returned text is useful only for display to the user. It might not be
-- possible to convert back to the same or any 'FilePath'.
toText :: FilePath -> Text

-- | Try to encode Text to a FilePath, using the current locale encoding. If
-- the text cannot be represented in the current locale, returns 'Nothing'.
fromText :: Text -> Maybe FilePath
--------------------------------------