Unicode workaround for getDirectoryContents under Windows?

Shu-yu Guo

14 Jun 2009 14 Jun '09

3:41 a.m.

Hello all, It seems like getDirectoryContents applies codepage conversion based on the default program locale under Windows. What this means is that if my default codepage is some kind of Latin, Asian glyphs get returned as '?' in the filename. By '?' I don't mean that the font is lacking the glyph and rendering it as '?', but I mean 'show (head (getDirectoryContents "C:\\Music"))' returns something that looks like like "?? ????". This is a problem as I can't get the filenames of my music directory, some of which are in Japanese and Chinese, some of which have accents. If I change the default codepage to Japanese, say, then I get the Japanese filenames in Shift-JIS and I lose all the accented letters. I have filed this as a bug already, but is there a workaround in the meantime (I don't know the Win32 API, but didn't see anything that looked like it would help under System.Win32 anyways) that lets me gets the list of files in a directory that's encoded in some kind of Unicode? Cheers, -- shu

Show replies by date

Judah Jacobson

14 Jun 14 Jun

4:56 a.m.

On Sat, Jun 13, 2009 at 8:41 PM, Shu-yu Guo wrote:

...

Hello all,

It seems like getDirectoryContents applies codepage conversion based on the default program locale under Windows. What this means is that if my default codepage is some kind of Latin, Asian glyphs get returned as '?' in the filename. By '?' I don't mean that the font is lacking the glyph and rendering it as '?', but I mean 'show (head (getDirectoryContents "C:\\Music"))' returns something that looks like like "?? ????".

This is a problem as I can't get the filenames of my music directory, some of which are in Japanese and Chinese, some of which have accents. If I change the default codepage to Japanese, say, then I get the Japanese filenames in Shift-JIS and I lose all the accented letters.

I have filed this as a bug already, but is there a workaround in the meantime (I don't know the Win32 API, but didn't see anything that looked like it would help under System.Win32 anyways) that lets me gets the list of files in a directory that's encoded in some kind of Unicode?

Try taking a look at the code in the following module, which uses FFI to access the Unicode-aware Win32 APIs: http://code.haskell.org/haskeline/System/Console/Haskeline/Directory.hsc Hope that helps, -Judah

Simon Marlow

16 Jun 16 Jun

11:30 a.m.

New subject: Unicode workaround for getDirectoryContents under Windows?

On 14/06/2009 05:56, Judah Jacobson wrote:

...

On Sat, Jun 13, 2009 at 8:41 PM, Shu-yu Guo wrote:

...
Hello all,

It seems like getDirectoryContents applies codepage conversion based on the default program locale under Windows. What this means is that if my default codepage is some kind of Latin, Asian glyphs get returned as '?' in the filename. By '?' I don't mean that the font is lacking the glyph and rendering it as '?', but I mean 'show (head (getDirectoryContents "C:\\Music"))' returns something that looks like like "?? ????".

This is a problem as I can't get the filenames of my music directory, some of which are in Japanese and Chinese, some of which have accents. If I change the default codepage to Japanese, say, then I get the Japanese filenames in Shift-JIS and I lose all the accented letters.

I have filed this as a bug already, but is there a workaround in the meantime (I don't know the Win32 API, but didn't see anything that looked like it would help under System.Win32 anyways) that lets me gets the list of files in a directory that's encoded in some kind of Unicode?

Try taking a look at the code in the following module, which uses FFI to access the Unicode-aware Win32 APIs:

http://code.haskell.org/haskeline/System/Console/Haskeline/Directory.hsc

Care to submit a patch to put this in System.Directory, or better still put the relevant functionality in System.Win32 and use it in System.Directory? Cheers, Simon

Bulat Ziganshin

11:42 a.m.

New subject: Unicode workaround for getDirectoryContents under Windows?

Hello Simon, Tuesday, June 16, 2009, 3:30:31 PM, you wrote:

...

Care to submit a patch to put this in System.Directory, or better still put the relevant functionality in System.Win32 and use it in System.Directory?

Simon, it will somewhat broke openFile. let's see. there are 3 types of filenames - 1) english (latin-1) only 2) including local (ansi code page) chars 3) including any other unicode chars now getDirectoryContents return ACP (ansi code page) names so openFile works for files 1) and 2) with such change getDirectoryContents will return correct unicode names, so openFile will work only with names in first group the right way is to fix ALL string-related calls in System.IO, System.Posix.Internals, System.Environment -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Simon Marlow

12:34 p.m.

New subject: Unicode workaround for getDirectoryContents under Windows?

On 16/06/2009 12:42, Bulat Ziganshin wrote:

...

Hello Simon,

Tuesday, June 16, 2009, 3:30:31 PM, you wrote:

...
Care to submit a patch to put this in System.Directory, or better still put the relevant functionality in System.Win32 and use it in System.Directory?

Simon, it will somewhat broke openFile. let's see. there are 3 types of filenames -

1) english (latin-1) only 2) including local (ansi code page) chars 3) including any other unicode chars

now getDirectoryContents return ACP (ansi code page) names so openFile works for files 1) and 2)

with such change getDirectoryContents will return correct unicode names, so openFile will work only with names in first group

the right way is to fix ALL string-related calls in System.IO, System.Posix.Internals, System.Environment

You're right in that we really ought to fix everything. However, I'm happy to just fix some of these things, even if it introduces some inconsistencies in the meantime. We already have much of System.Directory working with Unicode FilePaths, so there are already inconsistencies here. Thanks for reminding me that openFile is also broken. It's easily fixed, so I'll look into that. Cheers, Simon

Yitzchak Gale

12:46 p.m.

New subject: Unicode workaround for getDirectoryContents under Windows?

Simon Marlow wrote:

...

...
...
Care to submit a patch to put this in System.Directory, or better still put the relevant functionality in System.Win32 and use it in System.Directory?

Bulat Ziganshin wrote:

...

...
now getDirectoryContents return ACP (ansi code page) names so openFile works for files 1) and 2). With such change getDirectoryContents will return correct unicode names, so openFile will work only with names in first group. The right way is to fix ALL string-related calls in System.IO, System.Posix.Internals, System.Environment.

...

You're right in that we really ought to fix everything. However, I'm happy to just fix some of these things, even if it introduces some inconsistencies in the meantime. We already have much of System.Directory working with Unicode FilePaths, so there are already inconsistencies here.

+1 for integrating Unicode file paths. Thanks, Bulat! I think the most important use cases that should not break are: o open/read/write a FilePath from getArgs o open/read/write a FilePath from getDirectoryContents There's not much we can do about non-Latin-1 ACP file paths hard coded in Strings. I hope there aren't too many of those in the wild. Regards, Yitz

Simon Marlow

1:02 p.m.

New subject: Unicode workaround for getDirectoryContents under Windows?

On 16/06/2009 13:46, Yitzchak Gale wrote:

...

Simon Marlow wrote:

...
...
...
Care to submit a patch to put this in System.Directory, or better still put the relevant functionality in System.Win32 and use it in System.Directory?

Bulat Ziganshin wrote:

...
...
now getDirectoryContents return ACP (ansi code page) names so openFile works for files 1) and 2). With such change getDirectoryContents will return correct unicode names, so openFile will work only with names in first group. The right way is to fix ALL string-related calls in System.IO, System.Posix.Internals, System.Environment.

...
You're right in that we really ought to fix everything. However, I'm happy to just fix some of these things, even if it introduces some inconsistencies in the meantime. We already have much of System.Directory working with Unicode FilePaths, so there are already inconsistencies here.

+1 for integrating Unicode file paths. Thanks, Bulat!

Excuse my ignorance, but... what Unicode file paths?

...

I think the most important use cases that should not break are:

o open/read/write a FilePath from getArgs o open/read/write a FilePath from getDirectoryContents

There's not much we can do about non-Latin-1 ACP file paths hard coded in Strings. I hope there aren't too many of those in the wild.

The following cases are currently broken: * Calling openFile on a literal Unicode FilePath (note, not ACP-encoded, just Unicode). * Reading a Unicode FilePath from a text file and then calling openFile on it I propose to fix these (on Windows). It will mean that your second case above will be broken, until someone fixes getDirectoryContents. Also currently broken: * calling removeFile on a FilePath you get from getDirectoryContents, amongst other System.Directory operations Fixing getDirectoryContents will fix these. I don't know how getArgs fits in here - should we be decoding argv using the ACP? Cheers, Simon

Bulat Ziganshin

1:56 p.m.

New subject: Re[2]: Re: Unicode workaround for getDirectoryContents under Windows?

Hello Simon, Tuesday, June 16, 2009, 5:02:43 PM, you wrote:

...

Also currently broken:

...

* calling removeFile on a FilePath you get from getDirectoryContents, amongst other System.Directory operations

...

Fixing getDirectoryContents will fix these.

no. removeFile like anything else also uses ACP-based api

...

I don't know how getArgs fits in here - should we be decoding argv using the ACP?

well, the whole story: windows internally uses Unicode for handling strings. externally, it provides 2 API families: 1) A-family (such as CreateFileA) uses 8-bit char-based strings. these strings are encoded using current locale. First 128 chars are common for all codepages, providing ASCII char set, higher 128 chars are locale-specific. say, for German locale, it provides chars with umlauts, for Russian locale - cyrillic chars 2) W-family (such as CreateFileW) uses UTF-16 encoded 16-bit wchar-based strings, which are locale-independent Windows libraries emulates POSIX API (open, opendir, stat and so on) by translating these (char-based) calls into A-family. GHC libs are written Unix way, so these are effectively bundled to A-family of Win API Windows libraries also provides w* variant of POSIX API (wopen, wopendir, wstat...) that uses UTF-16 encoded 16-bit wchar-based strings, so for proper handling of Unicode strings (filenames, cmdline arguments) we should use these APIs my old proposal: http://haskell.org/haskellwiki/Library/IO -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Simon Marlow

3:30 p.m.

New subject: Unicode workaround for getDirectoryContents under Windows?

On 16/06/2009 14:56, Bulat Ziganshin wrote:

...

Hello Simon,

Tuesday, June 16, 2009, 5:02:43 PM, you wrote:

...
Also currently broken:

...
* calling removeFile on a FilePath you get from getDirectoryContents, amongst other System.Directory operations

...
Fixing getDirectoryContents will fix these.

no. removeFile like anything else also uses ACP-based api

What code are you looking at? Here is System.Directory.removeFile: removeFile :: FilePath -> IO () removeFile path = #if mingw32_HOST_OS System.Win32.deleteFile path #else System.Posix.removeLink path #endif and System.Win32.deleteFile: deleteFile :: String -> IO () deleteFile name = withTString name $ \ c_name -> failIfFalse_ "DeleteFile" $ c_DeleteFile c_name foreign import stdcall unsafe "windows.h DeleteFileW" c_DeleteFile :: LPCTSTR -> IO Bool note it's calling DeleteFileW, and using wide-char strings.

...

Windows libraries emulates POSIX API (open, opendir, stat and so on) by translating these (char-based) calls into A-family. GHC libs are written Unix way, so these are effectively bundled to A-family of Win API

Actually we use a mixture of CRT functions and native Windows API, gradually moving in the direction of the latter. Cheers, Simon

Bulat Ziganshin

3:44 p.m.

New subject: Re[2]: Re: Unicode workaround for getDirectoryContents under Windows?

Hello Simon, Tuesday, June 16, 2009, 7:30:55 PM, you wrote:

...

Actually we use a mixture of CRT functions and native Windows API, gradually moving in the direction of the latter.

so file-related APIs are already unpredictable, and will remain in this state for unknown amount of ghc versions -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Simon Marlow

3:54 p.m.

New subject: Unicode workaround for getDirectoryContents under Windows?

On 16/06/2009 16:44, Bulat Ziganshin wrote:

...

Hello Simon,

Tuesday, June 16, 2009, 7:30:55 PM, you wrote:

...
Actually we use a mixture of CRT functions and native Windows API, gradually moving in the direction of the latter.

so file-related APIs are already unpredictable, and will remain in this state for unknown amount of ghc versions

Sometimes fixing everything at the same time is too hard :-) In fact there's not a lot left to convert in System.Directory, as you'll see if you look at the code. Feel like helping? Cheers, Simon

Bulat Ziganshin

4:06 p.m.

New subject: Re[2]: Re: Unicode workaround for getDirectoryContents under Windows?

Hello Simon, Tuesday, June 16, 2009, 7:54:02 PM, you wrote:

...

In fact there's not a lot left to convert in System.Directory, as you'll see if you look at the code. Feel like helping?

these functions used there are ACP-only: c_stat c_chmod System.Win32.getFullPathName c_SearchPath c_SHGetFolderPath plus may be some more functions from System.Win32 package - i don't looked into it -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Simon Marlow

17 Jun 17 Jun

8:01 a.m.

New subject: Unicode workaround for getDirectoryContents under Windows?

On 16/06/2009 17:06, Bulat Ziganshin wrote:

...

Hello Simon,

Tuesday, June 16, 2009, 7:54:02 PM, you wrote:

...
In fact there's not a lot left to convert in System.Directory, as you'll see if you look at the code. Feel like helping?

these functions used there are ACP-only:

c_stat c_chmod System.Win32.getFullPathName c_SearchPath c_SHGetFolderPath

Yes, except for getFullPathName: foreign import stdcall unsafe "GetFullPathNameW" c_GetFullPathName :: LPCTSTR -> DWORD -> LPTSTR -> Ptr LPTSTR -> IO DWORD

...

plus may be some more functions from System.Win32 package - i don't looked into it

System.Win32 is using the wide-char APIs exclusively (ok, I haven't checked, but I don't know of any System.Win32 functions still using narrow strings). So as you can see, there's not much left to do. I'll fix openFile. Cheers, Simon

Bulat Ziganshin

8:43 a.m.

New subject: Re[2]: Re: Unicode workaround for getDirectoryContents under Windows?

Hello Simon, Wednesday, June 17, 2009, 12:01:11 PM, you wrote:

...

foreign import stdcall unsafe "GetFullPathNameW" c_GetFullPathName :: LPCTSTR -> DWORD -> LPTSTR -> Ptr LPTSTR -> IO DWORD

you are right, i was troubled by unused GetFullPathNameA import in System.Directory: #if defined(mingw32_HOST_OS) foreign import stdcall unsafe "GetFullPathNameA" c_GetFullPathName :: CString -> CInt -> CString -> Ptr CString -> IO CInt #else foreign import ccall unsafe "realpath" c_realpath :: CString -> CString -> IO CString #endif

...

So as you can see, there's not much left to do. I'll fix openFile.

c_stat is widely used here and there. it may be that half of System.Directory functions is broken due to direct or indirect calls to this function -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin

16 Jun 16 Jun

8:19 p.m.

New subject: Re[2]: Re: Unicode workaround for getDirectoryContents under Windows?

Hello Simon, Tuesday, June 16, 2009, 5:02:43 PM, you wrote:

...

I don't know how getArgs fits in here - should we be decoding argv using the ACP?

myGetArgs = do alloca $ \p_argc -> do p_argv_w <- commandLineToArgvW getCommandLineW p_argc argc <- peek p_argc argv_w <- peekArray (i argc) p_argv_w mapM peekTString argv_w >>= return.tail foreign import stdcall unsafe "windows.h GetCommandLineW" getCommandLineW :: LPTSTR foreign import stdcall unsafe "windows.h CommandLineToArgvW" commandLineToArgvW :: LPCWSTR -> Ptr CInt -> IO (Ptr LPWSTR) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Simon Marlow

17 Jun 17 Jun

7:55 a.m.

New subject: Unicode workaround for getDirectoryContents under Windows?

On 16/06/2009 21:19, Bulat Ziganshin wrote:

...

Hello Simon,

Tuesday, June 16, 2009, 5:02:43 PM, you wrote:

...
I don't know how getArgs fits in here - should we be decoding argv using the ACP?

myGetArgs = do alloca $ \p_argc -> do p_argv_w<- commandLineToArgvW getCommandLineW p_argc argc<- peek p_argc argv_w<- peekArray (i argc) p_argv_w mapM peekTString argv_w>>= return.tail

foreign import stdcall unsafe "windows.h GetCommandLineW" getCommandLineW :: LPTSTR

foreign import stdcall unsafe "windows.h CommandLineToArgvW" commandLineToArgvW :: LPCWSTR -> Ptr CInt -> IO (Ptr LPWSTR)

Right, so getArgs is already fine. Cheers, Simon

Bulat Ziganshin

8:38 a.m.

New subject: Re[2]: Re: Unicode workaround for getDirectoryContents under Windows?

Hello Simon, Wednesday, June 17, 2009, 11:55:15 AM, you wrote:

...

Right, so getArgs is already fine.

it's what i've found in Jun15 sources: #ifdef __GLASGOW_HASKELL__ getArgs :: IO [String] getArgs = alloca $ \ p_argc -> alloca $ \ p_argv -> do getProgArgv p_argc p_argv p <- fromIntegral `liftM` peek p_argc argv <- peek p_argv peekArray (p - 1) (advancePtr argv 1) >>= mapM peekCString foreign import ccall unsafe "getProgArgv" getProgArgv :: Ptr CInt -> Ptr (Ptr CString) -> IO () it uses peekCString so by any means it cannot produce unicode chars -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Simon Marlow

8:46 a.m.

New subject: Unicode workaround for getDirectoryContents under Windows?

On 17/06/2009 09:38, Bulat Ziganshin wrote:

...

Hello Simon,

Wednesday, June 17, 2009, 11:55:15 AM, you wrote:

...
Right, so getArgs is already fine.

it's what i've found in Jun15 sources:

#ifdef __GLASGOW_HASKELL__ getArgs :: IO [String] getArgs = alloca $ \ p_argc -> alloca $ \ p_argv -> do getProgArgv p_argc p_argv p<- fromIntegral `liftM` peek p_argc argv<- peek p_argv peekArray (p - 1) (advancePtr argv 1)>>= mapM peekCString

foreign import ccall unsafe "getProgArgv" getProgArgv :: Ptr CInt -> Ptr (Ptr CString) -> IO ()

it uses peekCString so by any means it cannot produce unicode chars

I see, so you were previously quoting code from some other source. Where did the GetCommandLineW version come from? Do you know of any issues that would prevent us using it in GHC? Cheers, Simon

Bulat Ziganshin

8:52 a.m.

New subject: Re[2]: Re: Unicode workaround for getDirectoryContents under Windows?

Hello Simon, Wednesday, June 17, 2009, 12:46:49 PM, you wrote:

...

I see, so you were previously quoting code from some other source.

from my program

...

Where did the GetCommandLineW version come from? Do you know of any issues that would prevent us using it in GHC?

it should be as fine as any other *W calls. the only thing is that we may prefer to include in into Win32 package as other routines and then call from there -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Simon Marlow

18 Jun 18 Jun

9:22 a.m.

New subject: Unicode workaround for getDirectoryContents under Windows?

On 16/06/2009 21:19, Bulat Ziganshin wrote:

...

Hello Simon,

Tuesday, June 16, 2009, 5:02:43 PM, you wrote:

...
I don't know how getArgs fits in here - should we be decoding argv using the ACP?

myGetArgs = do alloca $ \p_argc -> do p_argv_w<- commandLineToArgvW getCommandLineW p_argc argc<- peek p_argc argv_w<- peekArray (i argc) p_argv_w mapM peekTString argv_w>>= return.tail

foreign import stdcall unsafe "windows.h GetCommandLineW" getCommandLineW :: LPTSTR

foreign import stdcall unsafe "windows.h CommandLineToArgvW" commandLineToArgvW :: LPCWSTR -> Ptr CInt -> IO (Ptr LPWSTR)

Presumably we'd also have to remove the +RTS ... -RTS in Haskell if we did this, correct? Cheers, Simon

Bulat Ziganshin

9:33 a.m.

New subject: Re[2]: Unicode workaround for getDirectoryContents under Windows?

Hello Simon, Thursday, June 18, 2009, 1:22:30 PM, you wrote:

...

...
myGetArgs = do

...

Presumably we'd also have to remove the +RTS ... -RTS in Haskell if we did this, correct?

yes, it's long-standing in my own to-do list :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Yitzchak Gale

17 Jun 17 Jun

12:21 p.m.

New subject: Unicode workaround for getDirectoryContents under Windows?

I wrote:

...

...
I think the most important use cases that should not break are:

o open/read/write a FilePath from getArgs o open/read/write a FilePath from getDirectoryContents

Simon Marlow wrote:

...

The following cases are currently broken:

* Calling openFile on a literal Unicode FilePath (note, not ACP-encoded, just Unicode).

* Reading a Unicode FilePath from a text file and then calling openFile on it

I propose to fix these (on Windows). It will mean that your second case above will be broken, until someone fixes getDirectoryContents.

Why only on Windows?

...

I don't know how getArgs fits in here - should we be decoding argv using the ACP?

And why not also on Unix? On any platform, the expected behavior should be that you type a file path at the command line, read it using getArgs, and open the file using that. For comparison, Python works that way, even though the variable is called "argv" there. The current behavior on Unix of returning, say, UTF-8 encoding characters in a String as if they were individual Unicode characters, is queer. Given your fantastic work so far to rid System.IO of those kinds of oddities, perhaps now is the time to finish the job. If you think we really need to provide access to the raw argv bytes, we could add another platform-independent function that does that. Thanks, Yitz

Simon Marlow

12:46 p.m.

New subject: Unicode workaround for getDirectoryContents under Windows?

On 17/06/2009 13:21, Yitzchak Gale wrote:

...

I wrote:

...
...
I think the most important use cases that should not break are:

o open/read/write a FilePath from getArgs o open/read/write a FilePath from getDirectoryContents

Simon Marlow wrote:

...
The following cases are currently broken:

* Calling openFile on a literal Unicode FilePath (note, not ACP-encoded, just Unicode).

* Reading a Unicode FilePath from a text file and then calling openFile on it

I propose to fix these (on Windows). It will mean that your second case above will be broken, until someone fixes getDirectoryContents.

Why only on Windows?

Just because it's a lot easier on Windows - all the OS APIs take Unicode file paths, so it's obvious what to do. In contrast on Unix I don't have a clear idea of how to proceed. On Unix, all file APIs take [Word8] rather than [Char]. By convention, the [Word8] is usually assumed to be a string in the locale encoding, but that's only a user-space convention. So we should probably be converting from FilePath to [Word8] by encoding using the current locale. This raises various complications (what about encoding errors, and what if encode.decode is not the identity due to normalisation, etc.). But you don't have to wait for me to fix this stuff (I'm feeling a bit Unicoded-out right now :) If someone else has a good understanding of what needs done, please wade in.

...

...
I don't know how getArgs fits in here - should we be decoding argv using the ACP?

And why not also on Unix? On any platform, the expected behavior should be that you type a file path at the command line, read it using getArgs, and open the file using that.

Right. On Unix it works at the moment because we neither decode argv nor encode FilePaths, so the bytes get passed through unchanged. Same with getDirectoryContents. But I agree it's broken and needs to be fixed. Cheers, Simon

Ketil Malde

1:36 p.m.

New subject: Unicode workaround for getDirectoryContents under Windows?

Simon Marlow writes:

...

...
Why only on Windows?

...

Just because it's a lot easier on Windows - all the OS APIs take Unicode file paths, so it's obvious what to do. In contrast on Unix I don't have a clear idea of how to proceed.

...

On Unix, all file APIs take [Word8] rather than [Char]. By convention, the [Word8] is usually assumed to be a string in the locale encoding, but that's only a user-space convention.

If we want to incorporate a translation layer, I think it's fair to only support UTF-8 (ignoring locales), but provide a workaround for invalid characters.

...

From http://en.wikipedia.org/wiki/UTF-8:

| Therefore many modern UTF-8 converters translate errors to | something "safe". Only one byte is changed into the error | replacement and parsing starts again at the next byte, otherwise | concatenating strings could change good characters into | errors. Popular replacements for each byte are: | | * nothing (the bytes vanish) | * '?' or '�' | * The replacement character (U+FFFD) | * The byte from ISO-8859-1 or CP1252 | * An invalid Unicode code point, usually U+DCxx where xx is the byte's value How about using the last one? This would allow 'readFile' to work on FilePaths provided by 'getDirectoryContents', while allowing for real Unicode string literals. -k -- If I haven't seen further, it is by standing in the footprints of giants

Yitzchak Gale

2:07 p.m.

New subject: Unicode workaround for getDirectoryContents under Windows?

Ketil Malde wrote:

...

If we want to incorporate a translation layer, I think it's fair to only support UTF-8 (ignoring locales), but provide a workaround for invalid characters.

I disagree. Shells and GUI dialogs use the current locale. I think most other modern programming languages do too, but correct me if I am wrong. Still, your ideas about dealing with decoding errors sound useful. Regards, Yitz

Yitzchak Gale

2:03 p.m.

New subject: Unicode workaround for getDirectoryContents under Windows?

Simon Marlow wrote:

...

...
...
The following cases are currently broken... I propose to fix these (on Windows). It will mean that your second case above will be broken, until someone fixes getDirectoryContents... ...it's a lot easier on Windows... on Unix I don't have a clear idea of how to proceed... If someone else has a good understanding of what needs done, please wade in. I don't know how getArgs fits in here... I agree it's broken and needs to be fixed.

OK, would you like me to reflect this discussion in tickets? Let's see, so far we have #3300, I don't see anything else. Do you want two tickets, one each for WIndows/Unix? Or four, separating the FilePath and getArgs issues?

...

On Unix, all file APIs take [Word8]... So we should probably be converting from FilePath to [Word8] by encoding using the current locale... what about encoding errors,

Where relevant, we should emulate what the common shells do. In general, I don't see why they should be different than any other file operation error.

...

and what if encode.decode is not the identity due to normalisation

Well, is it common for people using typical input methods and common shells to create file paths containing text that decodes to non-normalized Unicode? I'm guessing not. If that's the case, then we don't really have to worry about it. People who went out of their way to create a weird file name will have the same troubles they have always had with that in Unix. But perhaps a better solution would be to make the underlying type of FilePath platform-dependent - e.g., String on Windows and [Word8] on Unix - and let it support platform- independent methods such as to/from String, to/from Bytes, setEncoding (defaulting to the current locale). That way, pass-through file paths will always work flawlessly on any platform, and applications have complete flexibility to deal with any other scenario however they choose. It's a breaking change though. Thanks, Yitz

Simon Marlow

2:18 p.m.

New subject: Unicode workaround for getDirectoryContents under Windows?

On 17/06/2009 15:03, Yitzchak Gale wrote:

...

Simon Marlow wrote:

...
...
...
The following cases are currently broken... I propose to fix these (on Windows). It will mean that your second case above will be broken, until someone fixes getDirectoryContents... ...it's a lot easier on Windows... on Unix I don't have a clear idea of how to proceed... If someone else has a good understanding of what needs done, please wade in. I don't know how getArgs fits in here... I agree it's broken and needs to be fixed.

OK, would you like me to reflect this discussion in tickets? Let's see, so far we have #3300, I don't see anything else.

Do you want two tickets, one each for WIndows/Unix? Or four, separating the FilePath and getArgs issues?

One for each issue is usually better, so four. Thanks!

...

...
On Unix, all file APIs take [Word8]... So we should probably be converting from FilePath to [Word8] by encoding using the current locale... what about encoding errors,

Where relevant, we should emulate what the common shells do. In general, I don't see why they should be different than any other file operation error.

...
and what if encode.decode is not the identity due to normalisation

Well, is it common for people using typical input methods and common shells to create file paths containing text that decodes to non-normalized Unicode?

I'm guessing not. If that's the case, then we don't really have to worry about it. People who went out of their way to create a weird file name will have the same troubles they have always had with that in Unix.

But perhaps a better solution would be to make the underlying type of FilePath platform-dependent - e.g., String on Windows and [Word8] on Unix - and let it support platform- independent methods such as to/from String, to/from Bytes, setEncoding (defaulting to the current locale). That way, pass-through file paths will always work flawlessly on any platform, and applications have complete flexibility to deal with any other scenario however they choose. It's a breaking change though.

Yes, we coud do a lot better if FilePath was an abstract type, but sadly it is not, and we can't change that without breaking Haskell 98 compatibility, not to mention tons of existing code. Cheers, Simon

Yitzchak Gale

18 Jun 18 Jun

1:47 a.m.

New subject: Unicode workaround for getDirectoryContents under Windows?

I wrote:

...

...
OK, would you like me to reflect this discussion in tickets? Let's see, so far we have #3300, I don't see anything else.

Do you want two tickets, one each for WIndows/Unix? Or four, separating the FilePath and getArgs issues?

Simon Marlow wrote:

...

One for each issue is usually better, so four.

OK, they are: #3300, #3307, #3308, #3309. Regards, Yitz

Duncan Coutts

2:18 p.m.

New subject: Unicode workaround for getDirectoryContents under Windows?

On Thu, 2009-06-18 at 04:47 +0300, Yitzchak Gale wrote:

...

I wrote:

...
...
OK, would you like me to reflect this discussion in tickets? Let's see, so far we have #3300, I don't see anything else.

Do you want two tickets, one each for WIndows/Unix? Or four, separating the FilePath and getArgs issues?

Simon Marlow wrote:

...
One for each issue is usually better, so four.

OK, they are: #3300, #3307, #3308, #3309.

Could we please make clear in those tickets that they only affect Windows. I do hope we are only proposing that FilePath be interpreted as Unicode on Window and OSX. It would break things to decode to Unicode on Unix systems. On Unix filepaths really are strings of bytes, not an encoding of Unicode code points. It's true that this is not reflected accurately in the type FilePath = String. The FilePath should be an opaque type that allows decoding into a human readable Unicode String. I wonder how much code would actually break if FilePath became an opaque type, eg if we make it an instance of IsString. It only need change in System.IO and System.FilePath, not in the old H98 modules. Duncan

Bulat Ziganshin

16 Jun 16 Jun

12:54 p.m.

New subject: Re[2]: Re: Unicode workaround for getDirectoryContents under Windows?

Hello Simon, Tuesday, June 16, 2009, 4:34:29 PM, you wrote:

...

Thanks for reminding me that openFile is also broken. It's easily fixed, so I'll look into that.

i fear that it will leave GHC libs in inconsistent state that can drive users mad. now at least there are some rules of brokeness. when some functions will be unicode-aware and some ansi codepaged, and this may chnage in every version, this "unicode" support will become completely useless. it will be like floating Base situation when it's impossible to write programs against Base since it's each time different also, i think that the best way to fix windows compatibility is to provide smth like this: #if WINDOWS type CWFilePath = LPCTSTR -- filename in C land type CWFileOffset = Int64 -- filesize or filepos in C land withCWFilePath = withTString -- FilePath->CWFilePath conversion peekCWFilePath = peekTString -- CWFilePath->FilePath conversion #else type CWFilePath = CString type CWFileOffset = COff withCWFilePath = withCString peekCWFilePath = peekCString #endif and then systematically rewrite all string-related OS API calls using these definitions how much meaning will be to have openFile and getDirContents unicode-aware, if deleteFile and even getFileStat aren't unicode-aware? i've attached my own internal module that makes this job for my own program - just for reference -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin

14 Jun 14 Jun

6:26 a.m.

Hello Shu-yu, Sunday, June 14, 2009, 7:41:46 AM, you wrote:

...

It seems like getDirectoryContents applies codepage conversion based

it's not a bug, but old-fashioned architecture of entire file apis you may find my Win32Files.hs module useful - it adopts UTF-16 versions of file operations http://downloads.sourceforge.net/freearc/FreeArc-0.51-sources.tar.bz2 -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

6044

Age (days ago)

6048

Last active (days ago)

List overview

Download

30 comments

7 participants

participants (7)

Bulat Ziganshin
Duncan Coutts
Judah Jacobson
Ketil Malde
Shu-yu Guo
Simon Marlow
Yitzchak Gale