Filename encoding error (was: Perform a research a la Unix 'find')

newer
feasability of implementing an awk...

older
ANNOUNCE: ipatch, the interactive...

Yves Parès

22 Aug 2010 22 Aug '10

5:23 p.m.

In fact the encoding problem is more general. When I simply do 'readFile "bar/fooé"', then I'm told: *** Exception: bar/fooé: openFile: does not exist (No such file or directory) How am I supposed to read files whose names contains non-ASCII characters? (I use GHC 6.12.3 under Ubuntu 10.04 32bits) ---------- Forwarded message ---------- From: Yves Parès Date: 2010/8/22 Subject: Re: [Haskell-cafe] Perform a research a la Unix 'find' To: Magnus Therning Cc: haskell-cafe@haskell.org I looked at both, and I have encoding issues with both. My locale is fr_FR.utf8 For instance, with HSH: I have a 'bar' directory, containing a file 'fooé' run $ "find bar" :: IO [String] returns me : ["bar", "bar/foo*\233*"] and run $ "find bar -name fooé" returns [] When I provoke an error by running: run $ "find fooé" it says : find: "foo*\351*": No file or directory So it is not the same encoding!

Attachments:

attachment.html (text/html — 1.4 KB)

Show replies by date

Alexey Khudyakov

22 Aug 22 Aug

5:42 p.m.

New subject: Filename encoding error (was: Perform a research a la Unix 'find')

On 22.08.2010 21:23, Yves Parès wrote:

...

In fact the encoding problem is more general.

When I simply do 'readFile "bar/fooé"', then I'm told: *** Exception: bar/fooé: openFile: does not exist (No such file or directory)

How am I supposed to read files whose names contains non-ASCII characters? (I use GHC 6.12.3 under Ubuntu 10.04 32bits)

Unicode handling in GHC always was a problem. There are corresponding tickets in bug tracker [1,2] You have to manually encode/decode strings to/from UTF8 which obviously works only for UTF8 locales. [1] http://hackage.haskell.org/trac/ghc/ticket/3307 [2] http://hackage.haskell.org/trac/ghc/ticket/3309

Daniel Fischer

7 p.m.

New subject: Filename encoding error (was: Perform a research a la Unix 'find')

On Sunday 22 August 2010 19:23:03, Yves Parès wrote:

...

In fact the encoding problem is more general.

When I simply do 'readFile "bar/fooé"', then I'm told: *** Exception: bar/fooé: openFile: does not exist (No such file or directory)

Try ghci> readFile (Data.ByteString.Char8.unpack (Data.ByteString.UTF8.fromString "fooé")) (same trick for find). The problem is probably that readFile filePath truncates the characters in filePath to 8 bits while the filepath on your system is UTF-8 encoded, so you have to give a pseudo-UTF-8 encoded filepath to readFile. At least, that's how it works here, inconvenient though it is.

...

How am I supposed to read files whose names contains non-ASCII characters? (I use GHC 6.12.3 under Ubuntu 10.04 32bits)

While the inconvenience lasts (people are thinking about how to handle the situation correctly), avoid non-ASCII characters in filepaths if possible.

...

My locale is fr_FR.utf8 For instance, with HSH: I have a 'bar' directory, containing a file 'fooé'

run $ "find bar" :: IO [String] returns me : ["bar", "bar/foo*\233*"]

That one is okay, 'é' is '\233' and the Show instance for Char escapes all characters > '\127'.

...

and run $ "find bar -name fooé" returns []

Maybe the same issue, try run $ "find bar -name foo\195\169"

...

When I provoke an error by running: run $ "find fooé" it says : find: "foo*\351*": No file or directory

On the other hand, if it now says \351, which is ş, there seems to be something else amiss.

...

So it is not the same encoding!

Simon Michael

23 Aug 23 Aug

11:36 p.m.

New subject: Filename encoding error (was: Perform a research a la Unix 'find')

I've been banging my head on the same issues. To summarise: GHC 6.12 strings are unicode; unix file paths are slightly restricted byte strings; the former is used to represent the latter, leading to great confusion; the best way to fix it is unclear. Here's a workaround I wrote this morning: -- | A platform string is a string value from or for the operating system, -- such as a file path or command-line argument (or environment variable's -- name or value ?). On some platforms (such as unix) these are not real -- unicode strings but have some encoding such as UTF-8. This alias does -- no type enforcement but aids code clarity. type PlatformString = String -- | Convert a possibly encoded platform string to a real unicode string. -- We decode the UTF-8 encoding recommended for unix systems -- (cf http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html) -- and leave anything else unchanged. fromPlatformString :: PlatformString -> String fromPlatformString s = if UTF8.isUTF8Encoded s then UTF8.decodeString s else s -- | Convert a unicode string to a possibly encoded platform string. -- On unix we encode with the recommended UTF-8 -- (cf http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html) -- and elsewhere we leave it unchanged. toPlatformString :: String -> PlatformString toPlatformString = case os of "unix" -> UTF8.encodeString "linux" -> UTF8.encodeString "darwin" -> UTF8.encodeString _ -> id

5438

Age (days ago)

5439

Last active (days ago)

List overview

Download

3 comments

4 participants

participants (4)

Alexey Khudyakov
Daniel Fischer
Simon Michael
Yves Parès