Filename encoding error (was: Perform a research a la Unix 'find')

In fact the encoding problem is more general.
When I simply do 'readFile "bar/fooé"', then I'm told:
*** Exception: bar/fooé: openFile: does not exist (No such file or
directory)
How am I supposed to read files whose names contains non-ASCII characters?
(I use GHC 6.12.3 under Ubuntu 10.04 32bits)
---------- Forwarded message ----------
From: Yves Parès

On 22.08.2010 21:23, Yves Parès wrote:
In fact the encoding problem is more general.
When I simply do 'readFile "bar/fooé"', then I'm told: *** Exception: bar/fooé: openFile: does not exist (No such file or directory)
How am I supposed to read files whose names contains non-ASCII characters? (I use GHC 6.12.3 under Ubuntu 10.04 32bits)
Unicode handling in GHC always was a problem. There are corresponding tickets in bug tracker [1,2] You have to manually encode/decode strings to/from UTF8 which obviously works only for UTF8 locales. [1] http://hackage.haskell.org/trac/ghc/ticket/3307 [2] http://hackage.haskell.org/trac/ghc/ticket/3309

On Sunday 22 August 2010 19:23:03, Yves Parès wrote:
In fact the encoding problem is more general.
When I simply do 'readFile "bar/fooé"', then I'm told: *** Exception: bar/fooé: openFile: does not exist (No such file or directory)
Try ghci> readFile (Data.ByteString.Char8.unpack (Data.ByteString.UTF8.fromString "fooé")) (same trick for find). The problem is probably that readFile filePath truncates the characters in filePath to 8 bits while the filepath on your system is UTF-8 encoded, so you have to give a pseudo-UTF-8 encoded filepath to readFile. At least, that's how it works here, inconvenient though it is.
How am I supposed to read files whose names contains non-ASCII characters? (I use GHC 6.12.3 under Ubuntu 10.04 32bits)
While the inconvenience lasts (people are thinking about how to handle the situation correctly), avoid non-ASCII characters in filepaths if possible.
My locale is fr_FR.utf8 For instance, with HSH: I have a 'bar' directory, containing a file 'fooé'
run $ "find bar" :: IO [String] returns me : ["bar", "bar/foo*\233*"]
That one is okay, 'é' is '\233' and the Show instance for Char escapes all characters > '\127'.
and run $ "find bar -name fooé" returns []
Maybe the same issue, try run $ "find bar -name foo\195\169"
When I provoke an error by running: run $ "find fooé" it says : find: "foo*\351*": No file or directory
On the other hand, if it now says \351, which is ş, there seems to be something else amiss.
So it is not the same encoding!

I've been banging my head on the same issues. To summarise: GHC 6.12 strings are unicode; unix file paths are slightly restricted byte strings; the former is used to represent the latter, leading to great confusion; the best way to fix it is unclear. Here's a workaround I wrote this morning: -- | A platform string is a string value from or for the operating system, -- such as a file path or command-line argument (or environment variable's -- name or value ?). On some platforms (such as unix) these are not real -- unicode strings but have some encoding such as UTF-8. This alias does -- no type enforcement but aids code clarity. type PlatformString = String -- | Convert a possibly encoded platform string to a real unicode string. -- We decode the UTF-8 encoding recommended for unix systems -- (cf http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html) -- and leave anything else unchanged. fromPlatformString :: PlatformString -> String fromPlatformString s = if UTF8.isUTF8Encoded s then UTF8.decodeString s else s -- | Convert a unicode string to a possibly encoded platform string. -- On unix we encode with the recommended UTF-8 -- (cf http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html) -- and elsewhere we leave it unchanged. toPlatformString :: String -> PlatformString toPlatformString = case os of "unix" -> UTF8.encodeString "linux" -> UTF8.encodeString "darwin" -> UTF8.encodeString _ -> id
participants (4)
-
Alexey Khudyakov
-
Daniel Fischer
-
Simon Michael
-
Yves Parès