On Wed, Mar 30, 2011 at 09:26, Jason Dagit <dagitj@gmail.com> wrote:


On Tue, Mar 29, 2011 at 11:52 PM, Michael Snoyman <michael@snoyman.com> wrote:
Hi all,

I think this is a well-known issue: it seems that there is no
character decoding performed on the values returned from the functions
in System.Directory (getDirectoryContents specifically). I could
manually do something like (utf8Decode . S8.pack), but that presumes
that the character encoding on the system in question is UTF8. So two
questions:

* Is there a package out there that handles all the gory details for
me automatically, and simply returns a properly decoded String (or
Text)?
* If not, is there a standard way to determine the character encoding
used by the filesystem, short of hard-coding in character encodings
used by the major ones?

I started to write a thoughtful reply, but I found that the answers here sum up everything I was going to say:
http://unix.stackexchange.com/questions/2089/what-charset-encoding-is-used-for-filenames-and-paths-on-linux

This same issue comes up from time to time for darcs and, if I recall correctly, the solution has been to treat unix file paths as arbitrary bytes whenever possible and to escape non-ascii compatible bytes when they occur.  Otherwise it can be hard to encode them in textual patch descriptions or xml (where an encoding is required and I believe utf8 is a standard default).

I wish you luck.  It's not as easy problem, at least on unix.  I've heard that windows has a much easier time here as MS has provided a standard for it.

All the more reason it seems to make this available in the standard package, so people don't have to figure out how to the conversions each time (for all the different OSes with whcih they might not have any experience etc) .

All modern Linuxes use UTF8 by default anyway so in the beginning one could assume UTF8 and later change the system to be able to make more intelligent decisions (like checking environment variables for per-user settings). A way to override the assumptions made would be necessary too I guess.

-Tako