Unicode workaround for getDirectoryContents under Windows?
Hello all, It seems like getDirectoryContents applies codepage conversion based on the default program locale under Windows. What this means is that if my default codepage is some kind of Latin, Asian glyphs get returned as '?' in the filename. By '?' I don't mean that the font is lacking the glyph and rendering it as '?', but I mean 'show (head (getDirectoryContents "C:\\Music"))' returns something that looks like like "?? ????". This is a problem as I can't get the filenames of my music directory, some of which are in Japanese and Chinese, some of which have accents. If I change the default codepage to Japanese, say, then I get the Japanese filenames in Shift-JIS and I lose all the accented letters. I have filed this as a bug already, but is there a workaround in the meantime (I don't know the Win32 API, but didn't see anything that looked like it would help under System.Win32 anyways) that lets me gets the list of files in a directory that's encoded in some kind of Unicode? Cheers, -- shu
On Sat, Jun 13, 2009 at 8:41 PM, Shu-yu Guo
Hello all,
It seems like getDirectoryContents applies codepage conversion based on the default program locale under Windows. What this means is that if my default codepage is some kind of Latin, Asian glyphs get returned as '?' in the filename. By '?' I don't mean that the font is lacking the glyph and rendering it as '?', but I mean 'show (head (getDirectoryContents "C:\\Music"))' returns something that looks like like "?? ????".
This is a problem as I can't get the filenames of my music directory, some of which are in Japanese and Chinese, some of which have accents. If I change the default codepage to Japanese, say, then I get the Japanese filenames in Shift-JIS and I lose all the accented letters.
I have filed this as a bug already, but is there a workaround in the meantime (I don't know the Win32 API, but didn't see anything that looked like it would help under System.Win32 anyways) that lets me gets the list of files in a directory that's encoded in some kind of Unicode?
Try taking a look at the code in the following module, which uses FFI to access the Unicode-aware Win32 APIs: http://code.haskell.org/haskeline/System/Console/Haskeline/Directory.hsc Hope that helps, -Judah
On 14/06/2009 05:56, Judah Jacobson wrote:
On Sat, Jun 13, 2009 at 8:41 PM, Shu-yu Guo
wrote: Hello all,
It seems like getDirectoryContents applies codepage conversion based on the default program locale under Windows. What this means is that if my default codepage is some kind of Latin, Asian glyphs get returned as '?' in the filename. By '?' I don't mean that the font is lacking the glyph and rendering it as '?', but I mean 'show (head (getDirectoryContents "C:\\Music"))' returns something that looks like like "?? ????".
This is a problem as I can't get the filenames of my music directory, some of which are in Japanese and Chinese, some of which have accents. If I change the default codepage to Japanese, say, then I get the Japanese filenames in Shift-JIS and I lose all the accented letters.
I have filed this as a bug already, but is there a workaround in the meantime (I don't know the Win32 API, but didn't see anything that looked like it would help under System.Win32 anyways) that lets me gets the list of files in a directory that's encoded in some kind of Unicode?
Try taking a look at the code in the following module, which uses FFI to access the Unicode-aware Win32 APIs:
http://code.haskell.org/haskeline/System/Console/Haskeline/Directory.hsc
Care to submit a patch to put this in System.Directory, or better still put the relevant functionality in System.Win32 and use it in System.Directory? Cheers, Simon
Hello Simon, Tuesday, June 16, 2009, 3:30:31 PM, you wrote:
Care to submit a patch to put this in System.Directory, or better still put the relevant functionality in System.Win32 and use it in System.Directory?
Simon, it will somewhat broke openFile. let's see. there are 3 types of filenames - 1) english (latin-1) only 2) including local (ansi code page) chars 3) including any other unicode chars now getDirectoryContents return ACP (ansi code page) names so openFile works for files 1) and 2) with such change getDirectoryContents will return correct unicode names, so openFile will work only with names in first group the right way is to fix ALL string-related calls in System.IO, System.Posix.Internals, System.Environment -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
On 16/06/2009 12:42, Bulat Ziganshin wrote:
Hello Simon,
Tuesday, June 16, 2009, 3:30:31 PM, you wrote:
Care to submit a patch to put this in System.Directory, or better still put the relevant functionality in System.Win32 and use it in System.Directory?
Simon, it will somewhat broke openFile. let's see. there are 3 types of filenames -
1) english (latin-1) only 2) including local (ansi code page) chars 3) including any other unicode chars
now getDirectoryContents return ACP (ansi code page) names so openFile works for files 1) and 2)
with such change getDirectoryContents will return correct unicode names, so openFile will work only with names in first group
the right way is to fix ALL string-related calls in System.IO, System.Posix.Internals, System.Environment
You're right in that we really ought to fix everything. However, I'm happy to just fix some of these things, even if it introduces some inconsistencies in the meantime. We already have much of System.Directory working with Unicode FilePaths, so there are already inconsistencies here. Thanks for reminding me that openFile is also broken. It's easily fixed, so I'll look into that. Cheers, Simon
Simon Marlow wrote:
Care to submit a patch to put this in System.Directory, or better still put the relevant functionality in System.Win32 and use it in System.Directory?
Bulat Ziganshin wrote:
now getDirectoryContents return ACP (ansi code page) names so openFile works for files 1) and 2). With such change getDirectoryContents will return correct unicode names, so openFile will work only with names in first group. The right way is to fix ALL string-related calls in System.IO, System.Posix.Internals, System.Environment.
You're right in that we really ought to fix everything. However, I'm happy to just fix some of these things, even if it introduces some inconsistencies in the meantime. We already have much of System.Directory working with Unicode FilePaths, so there are already inconsistencies here.
+1 for integrating Unicode file paths. Thanks, Bulat! I think the most important use cases that should not break are: o open/read/write a FilePath from getArgs o open/read/write a FilePath from getDirectoryContents There's not much we can do about non-Latin-1 ACP file paths hard coded in Strings. I hope there aren't too many of those in the wild. Regards, Yitz
On 16/06/2009 13:46, Yitzchak Gale wrote:
Simon Marlow wrote:
Care to submit a patch to put this in System.Directory, or better still put the relevant functionality in System.Win32 and use it in System.Directory?
Bulat Ziganshin wrote:
now getDirectoryContents return ACP (ansi code page) names so openFile works for files 1) and 2). With such change getDirectoryContents will return correct unicode names, so openFile will work only with names in first group. The right way is to fix ALL string-related calls in System.IO, System.Posix.Internals, System.Environment.
You're right in that we really ought to fix everything. However, I'm happy to just fix some of these things, even if it introduces some inconsistencies in the meantime. We already have much of System.Directory working with Unicode FilePaths, so there are already inconsistencies here.
+1 for integrating Unicode file paths. Thanks, Bulat!
Excuse my ignorance, but... what Unicode file paths?
I think the most important use cases that should not break are:
o open/read/write a FilePath from getArgs o open/read/write a FilePath from getDirectoryContents
There's not much we can do about non-Latin-1 ACP file paths hard coded in Strings. I hope there aren't too many of those in the wild.
The following cases are currently broken: * Calling openFile on a literal Unicode FilePath (note, not ACP-encoded, just Unicode). * Reading a Unicode FilePath from a text file and then calling openFile on it I propose to fix these (on Windows). It will mean that your second case above will be broken, until someone fixes getDirectoryContents. Also currently broken: * calling removeFile on a FilePath you get from getDirectoryContents, amongst other System.Directory operations Fixing getDirectoryContents will fix these. I don't know how getArgs fits in here - should we be decoding argv using the ACP? Cheers, Simon
Hello Simon, Tuesday, June 16, 2009, 5:02:43 PM, you wrote:
Also currently broken:
* calling removeFile on a FilePath you get from getDirectoryContents, amongst other System.Directory operations
Fixing getDirectoryContents will fix these.
no. removeFile like anything else also uses ACP-based api
I don't know how getArgs fits in here - should we be decoding argv using the ACP?
well, the whole story: windows internally uses Unicode for handling strings. externally, it provides 2 API families: 1) A-family (such as CreateFileA) uses 8-bit char-based strings. these strings are encoded using current locale. First 128 chars are common for all codepages, providing ASCII char set, higher 128 chars are locale-specific. say, for German locale, it provides chars with umlauts, for Russian locale - cyrillic chars 2) W-family (such as CreateFileW) uses UTF-16 encoded 16-bit wchar-based strings, which are locale-independent Windows libraries emulates POSIX API (open, opendir, stat and so on) by translating these (char-based) calls into A-family. GHC libs are written Unix way, so these are effectively bundled to A-family of Win API Windows libraries also provides w* variant of POSIX API (wopen, wopendir, wstat...) that uses UTF-16 encoded 16-bit wchar-based strings, so for proper handling of Unicode strings (filenames, cmdline arguments) we should use these APIs my old proposal: http://haskell.org/haskellwiki/Library/IO -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
On 16/06/2009 14:56, Bulat Ziganshin wrote:
Hello Simon,
Tuesday, June 16, 2009, 5:02:43 PM, you wrote:
Also currently broken:
* calling removeFile on a FilePath you get from getDirectoryContents, amongst other System.Directory operations
Fixing getDirectoryContents will fix these.
no. removeFile like anything else also uses ACP-based api
What code are you looking at? Here is System.Directory.removeFile: removeFile :: FilePath -> IO () removeFile path = #if mingw32_HOST_OS System.Win32.deleteFile path #else System.Posix.removeLink path #endif and System.Win32.deleteFile: deleteFile :: String -> IO () deleteFile name = withTString name $ \ c_name -> failIfFalse_ "DeleteFile" $ c_DeleteFile c_name foreign import stdcall unsafe "windows.h DeleteFileW" c_DeleteFile :: LPCTSTR -> IO Bool note it's calling DeleteFileW, and using wide-char strings.
Windows libraries emulates POSIX API (open, opendir, stat and so on) by translating these (char-based) calls into A-family. GHC libs are written Unix way, so these are effectively bundled to A-family of Win API
Actually we use a mixture of CRT functions and native Windows API, gradually moving in the direction of the latter. Cheers, Simon
Hello Simon, Tuesday, June 16, 2009, 7:30:55 PM, you wrote:
Actually we use a mixture of CRT functions and native Windows API, gradually moving in the direction of the latter.
so file-related APIs are already unpredictable, and will remain in this state for unknown amount of ghc versions -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
On 16/06/2009 16:44, Bulat Ziganshin wrote:
Hello Simon,
Tuesday, June 16, 2009, 7:30:55 PM, you wrote:
Actually we use a mixture of CRT functions and native Windows API, gradually moving in the direction of the latter.
so file-related APIs are already unpredictable, and will remain in this state for unknown amount of ghc versions
Sometimes fixing everything at the same time is too hard :-) In fact there's not a lot left to convert in System.Directory, as you'll see if you look at the code. Feel like helping? Cheers, Simon
Hello Simon, Tuesday, June 16, 2009, 7:54:02 PM, you wrote:
In fact there's not a lot left to convert in System.Directory, as you'll see if you look at the code. Feel like helping?
these functions used there are ACP-only: c_stat c_chmod System.Win32.getFullPathName c_SearchPath c_SHGetFolderPath plus may be some more functions from System.Win32 package - i don't looked into it -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
On 16/06/2009 17:06, Bulat Ziganshin wrote:
Hello Simon,
Tuesday, June 16, 2009, 7:54:02 PM, you wrote:
In fact there's not a lot left to convert in System.Directory, as you'll see if you look at the code. Feel like helping?
these functions used there are ACP-only:
c_stat c_chmod System.Win32.getFullPathName c_SearchPath c_SHGetFolderPath
Yes, except for getFullPathName: foreign import stdcall unsafe "GetFullPathNameW" c_GetFullPathName :: LPCTSTR -> DWORD -> LPTSTR -> Ptr LPTSTR -> IO DWORD
plus may be some more functions from System.Win32 package - i don't looked into it
System.Win32 is using the wide-char APIs exclusively (ok, I haven't checked, but I don't know of any System.Win32 functions still using narrow strings). So as you can see, there's not much left to do. I'll fix openFile. Cheers, Simon
Hello Simon, Wednesday, June 17, 2009, 12:01:11 PM, you wrote:
foreign import stdcall unsafe "GetFullPathNameW" c_GetFullPathName :: LPCTSTR -> DWORD -> LPTSTR -> Ptr LPTSTR -> IO DWORD
you are right, i was troubled by unused GetFullPathNameA import in System.Directory: #if defined(mingw32_HOST_OS) foreign import stdcall unsafe "GetFullPathNameA" c_GetFullPathName :: CString -> CInt -> CString -> Ptr CString -> IO CInt #else foreign import ccall unsafe "realpath" c_realpath :: CString -> CString -> IO CString #endif
So as you can see, there's not much left to do. I'll fix openFile.
c_stat is widely used here and there. it may be that half of System.Directory functions is broken due to direct or indirect calls to this function -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
Hello Simon, Tuesday, June 16, 2009, 5:02:43 PM, you wrote:
I don't know how getArgs fits in here - should we be decoding argv using the ACP?
myGetArgs = do alloca $ \p_argc -> do p_argv_w <- commandLineToArgvW getCommandLineW p_argc argc <- peek p_argc argv_w <- peekArray (i argc) p_argv_w mapM peekTString argv_w >>= return.tail foreign import stdcall unsafe "windows.h GetCommandLineW" getCommandLineW :: LPTSTR foreign import stdcall unsafe "windows.h CommandLineToArgvW" commandLineToArgvW :: LPCWSTR -> Ptr CInt -> IO (Ptr LPWSTR) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
On 16/06/2009 21:19, Bulat Ziganshin wrote:
Hello Simon,
Tuesday, June 16, 2009, 5:02:43 PM, you wrote:
I don't know how getArgs fits in here - should we be decoding argv using the ACP?
myGetArgs = do alloca $ \p_argc -> do p_argv_w<- commandLineToArgvW getCommandLineW p_argc argc<- peek p_argc argv_w<- peekArray (i argc) p_argv_w mapM peekTString argv_w>>= return.tail
foreign import stdcall unsafe "windows.h GetCommandLineW" getCommandLineW :: LPTSTR
foreign import stdcall unsafe "windows.h CommandLineToArgvW" commandLineToArgvW :: LPCWSTR -> Ptr CInt -> IO (Ptr LPWSTR)
Right, so getArgs is already fine. Cheers, Simon
Hello Simon, Wednesday, June 17, 2009, 11:55:15 AM, you wrote:
Right, so getArgs is already fine.
it's what i've found in Jun15 sources: #ifdef __GLASGOW_HASKELL__ getArgs :: IO [String] getArgs = alloca $ \ p_argc -> alloca $ \ p_argv -> do getProgArgv p_argc p_argv p <- fromIntegral `liftM` peek p_argc argv <- peek p_argv peekArray (p - 1) (advancePtr argv 1) >>= mapM peekCString foreign import ccall unsafe "getProgArgv" getProgArgv :: Ptr CInt -> Ptr (Ptr CString) -> IO () it uses peekCString so by any means it cannot produce unicode chars -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
On 17/06/2009 09:38, Bulat Ziganshin wrote:
Hello Simon,
Wednesday, June 17, 2009, 11:55:15 AM, you wrote:
Right, so getArgs is already fine.
it's what i've found in Jun15 sources:
#ifdef __GLASGOW_HASKELL__ getArgs :: IO [String] getArgs = alloca $ \ p_argc -> alloca $ \ p_argv -> do getProgArgv p_argc p_argv p<- fromIntegral `liftM` peek p_argc argv<- peek p_argv peekArray (p - 1) (advancePtr argv 1)>>= mapM peekCString
foreign import ccall unsafe "getProgArgv" getProgArgv :: Ptr CInt -> Ptr (Ptr CString) -> IO ()
it uses peekCString so by any means it cannot produce unicode chars
I see, so you were previously quoting code from some other source. Where did the GetCommandLineW version come from? Do you know of any issues that would prevent us using it in GHC? Cheers, Simon
Hello Simon, Wednesday, June 17, 2009, 12:46:49 PM, you wrote:
I see, so you were previously quoting code from some other source.
from my program
Where did the GetCommandLineW version come from? Do you know of any issues that would prevent us using it in GHC?
it should be as fine as any other *W calls. the only thing is that we may prefer to include in into Win32 package as other routines and then call from there -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
On 16/06/2009 21:19, Bulat Ziganshin wrote:
Hello Simon,
Tuesday, June 16, 2009, 5:02:43 PM, you wrote:
I don't know how getArgs fits in here - should we be decoding argv using the ACP?
myGetArgs = do alloca $ \p_argc -> do p_argv_w<- commandLineToArgvW getCommandLineW p_argc argc<- peek p_argc argv_w<- peekArray (i argc) p_argv_w mapM peekTString argv_w>>= return.tail
foreign import stdcall unsafe "windows.h GetCommandLineW" getCommandLineW :: LPTSTR
foreign import stdcall unsafe "windows.h CommandLineToArgvW" commandLineToArgvW :: LPCWSTR -> Ptr CInt -> IO (Ptr LPWSTR)
Presumably we'd also have to remove the +RTS ... -RTS in Haskell if we did this, correct? Cheers, Simon
Hello Simon, Thursday, June 18, 2009, 1:22:30 PM, you wrote:
myGetArgs = do
Presumably we'd also have to remove the +RTS ... -RTS in Haskell if we did this, correct?
yes, it's long-standing in my own to-do list :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
I wrote:
I think the most important use cases that should not break are:
o open/read/write a FilePath from getArgs o open/read/write a FilePath from getDirectoryContents
Simon Marlow wrote:
The following cases are currently broken:
* Calling openFile on a literal Unicode FilePath (note, not ACP-encoded, just Unicode).
* Reading a Unicode FilePath from a text file and then calling openFile on it
I propose to fix these (on Windows). It will mean that your second case above will be broken, until someone fixes getDirectoryContents.
Why only on Windows?
I don't know how getArgs fits in here - should we be decoding argv using the ACP?
And why not also on Unix? On any platform, the expected behavior should be that you type a file path at the command line, read it using getArgs, and open the file using that. For comparison, Python works that way, even though the variable is called "argv" there. The current behavior on Unix of returning, say, UTF-8 encoding characters in a String as if they were individual Unicode characters, is queer. Given your fantastic work so far to rid System.IO of those kinds of oddities, perhaps now is the time to finish the job. If you think we really need to provide access to the raw argv bytes, we could add another platform-independent function that does that. Thanks, Yitz
On 17/06/2009 13:21, Yitzchak Gale wrote:
I wrote:
I think the most important use cases that should not break are:
o open/read/write a FilePath from getArgs o open/read/write a FilePath from getDirectoryContents
Simon Marlow wrote:
The following cases are currently broken:
* Calling openFile on a literal Unicode FilePath (note, not ACP-encoded, just Unicode).
* Reading a Unicode FilePath from a text file and then calling openFile on it
I propose to fix these (on Windows). It will mean that your second case above will be broken, until someone fixes getDirectoryContents.
Why only on Windows?
Just because it's a lot easier on Windows - all the OS APIs take Unicode file paths, so it's obvious what to do. In contrast on Unix I don't have a clear idea of how to proceed. On Unix, all file APIs take [Word8] rather than [Char]. By convention, the [Word8] is usually assumed to be a string in the locale encoding, but that's only a user-space convention. So we should probably be converting from FilePath to [Word8] by encoding using the current locale. This raises various complications (what about encoding errors, and what if encode.decode is not the identity due to normalisation, etc.). But you don't have to wait for me to fix this stuff (I'm feeling a bit Unicoded-out right now :) If someone else has a good understanding of what needs done, please wade in.
I don't know how getArgs fits in here - should we be decoding argv using the ACP?
And why not also on Unix? On any platform, the expected behavior should be that you type a file path at the command line, read it using getArgs, and open the file using that.
Right. On Unix it works at the moment because we neither decode argv nor encode FilePaths, so the bytes get passed through unchanged. Same with getDirectoryContents. But I agree it's broken and needs to be fixed. Cheers, Simon
Simon Marlow
Why only on Windows?
Just because it's a lot easier on Windows - all the OS APIs take Unicode file paths, so it's obvious what to do. In contrast on Unix I don't have a clear idea of how to proceed.
On Unix, all file APIs take [Word8] rather than [Char]. By convention, the [Word8] is usually assumed to be a string in the locale encoding, but that's only a user-space convention.
If we want to incorporate a translation layer, I think it's fair to only support UTF-8 (ignoring locales), but provide a workaround for invalid characters.
| Therefore many modern UTF-8 converters translate errors to | something "safe". Only one byte is changed into the error | replacement and parsing starts again at the next byte, otherwise | concatenating strings could change good characters into | errors. Popular replacements for each byte are: | | * nothing (the bytes vanish) | * '?' or '�' | * The replacement character (U+FFFD) | * The byte from ISO-8859-1 or CP1252 | * An invalid Unicode code point, usually U+DCxx where xx is the byte's value How about using the last one? This would allow 'readFile' to work on FilePaths provided by 'getDirectoryContents', while allowing for real Unicode string literals. -k -- If I haven't seen further, it is by standing in the footprints of giants
Ketil Malde wrote:
If we want to incorporate a translation layer, I think it's fair to only support UTF-8 (ignoring locales), but provide a workaround for invalid characters.
I disagree. Shells and GUI dialogs use the current locale. I think most other modern programming languages do too, but correct me if I am wrong. Still, your ideas about dealing with decoding errors sound useful. Regards, Yitz
Simon Marlow wrote:
The following cases are currently broken... I propose to fix these (on Windows). It will mean that your second case above will be broken, until someone fixes getDirectoryContents... ...it's a lot easier on Windows... on Unix I don't have a clear idea of how to proceed... If someone else has a good understanding of what needs done, please wade in. I don't know how getArgs fits in here... I agree it's broken and needs to be fixed.
OK, would you like me to reflect this discussion in tickets? Let's see, so far we have #3300, I don't see anything else. Do you want two tickets, one each for WIndows/Unix? Or four, separating the FilePath and getArgs issues?
On Unix, all file APIs take [Word8]... So we should probably be converting from FilePath to [Word8] by encoding using the current locale... what about encoding errors,
Where relevant, we should emulate what the common shells do. In general, I don't see why they should be different than any other file operation error.
and what if encode.decode is not the identity due to normalisation
Well, is it common for people using typical input methods and common shells to create file paths containing text that decodes to non-normalized Unicode? I'm guessing not. If that's the case, then we don't really have to worry about it. People who went out of their way to create a weird file name will have the same troubles they have always had with that in Unix. But perhaps a better solution would be to make the underlying type of FilePath platform-dependent - e.g., String on Windows and [Word8] on Unix - and let it support platform- independent methods such as to/from String, to/from Bytes, setEncoding (defaulting to the current locale). That way, pass-through file paths will always work flawlessly on any platform, and applications have complete flexibility to deal with any other scenario however they choose. It's a breaking change though. Thanks, Yitz
On 17/06/2009 15:03, Yitzchak Gale wrote:
Simon Marlow wrote:
The following cases are currently broken... I propose to fix these (on Windows). It will mean that your second case above will be broken, until someone fixes getDirectoryContents... ...it's a lot easier on Windows... on Unix I don't have a clear idea of how to proceed... If someone else has a good understanding of what needs done, please wade in. I don't know how getArgs fits in here... I agree it's broken and needs to be fixed.
OK, would you like me to reflect this discussion in tickets? Let's see, so far we have #3300, I don't see anything else.
Do you want two tickets, one each for WIndows/Unix? Or four, separating the FilePath and getArgs issues?
One for each issue is usually better, so four. Thanks!
On Unix, all file APIs take [Word8]... So we should probably be converting from FilePath to [Word8] by encoding using the current locale... what about encoding errors,
Where relevant, we should emulate what the common shells do. In general, I don't see why they should be different than any other file operation error.
and what if encode.decode is not the identity due to normalisation
Well, is it common for people using typical input methods and common shells to create file paths containing text that decodes to non-normalized Unicode?
I'm guessing not. If that's the case, then we don't really have to worry about it. People who went out of their way to create a weird file name will have the same troubles they have always had with that in Unix.
But perhaps a better solution would be to make the underlying type of FilePath platform-dependent - e.g., String on Windows and [Word8] on Unix - and let it support platform- independent methods such as to/from String, to/from Bytes, setEncoding (defaulting to the current locale). That way, pass-through file paths will always work flawlessly on any platform, and applications have complete flexibility to deal with any other scenario however they choose. It's a breaking change though.
Yes, we coud do a lot better if FilePath was an abstract type, but sadly it is not, and we can't change that without breaking Haskell 98 compatibility, not to mention tons of existing code. Cheers, Simon
I wrote:
OK, would you like me to reflect this discussion in tickets? Let's see, so far we have #3300, I don't see anything else.
Do you want two tickets, one each for WIndows/Unix? Or four, separating the FilePath and getArgs issues?
Simon Marlow wrote:
One for each issue is usually better, so four.
OK, they are: #3300, #3307, #3308, #3309. Regards, Yitz
On Thu, 2009-06-18 at 04:47 +0300, Yitzchak Gale wrote:
I wrote:
OK, would you like me to reflect this discussion in tickets? Let's see, so far we have #3300, I don't see anything else.
Do you want two tickets, one each for WIndows/Unix? Or four, separating the FilePath and getArgs issues?
Simon Marlow wrote:
One for each issue is usually better, so four.
OK, they are: #3300, #3307, #3308, #3309.
Could we please make clear in those tickets that they only affect Windows. I do hope we are only proposing that FilePath be interpreted as Unicode on Window and OSX. It would break things to decode to Unicode on Unix systems. On Unix filepaths really are strings of bytes, not an encoding of Unicode code points. It's true that this is not reflected accurately in the type FilePath = String. The FilePath should be an opaque type that allows decoding into a human readable Unicode String. I wonder how much code would actually break if FilePath became an opaque type, eg if we make it an instance of IsString. It only need change in System.IO and System.FilePath, not in the old H98 modules. Duncan
Hello Simon, Tuesday, June 16, 2009, 4:34:29 PM, you wrote:
Thanks for reminding me that openFile is also broken. It's easily fixed, so I'll look into that.
i fear that it will leave GHC libs in inconsistent state that can drive users mad. now at least there are some rules of brokeness. when some functions will be unicode-aware and some ansi codepaged, and this may chnage in every version, this "unicode" support will become completely useless. it will be like floating Base situation when it's impossible to write programs against Base since it's each time different also, i think that the best way to fix windows compatibility is to provide smth like this: #if WINDOWS type CWFilePath = LPCTSTR -- filename in C land type CWFileOffset = Int64 -- filesize or filepos in C land withCWFilePath = withTString -- FilePath->CWFilePath conversion peekCWFilePath = peekTString -- CWFilePath->FilePath conversion #else type CWFilePath = CString type CWFileOffset = COff withCWFilePath = withCString peekCWFilePath = peekCString #endif and then systematically rewrite all string-related OS API calls using these definitions how much meaning will be to have openFile and getDirContents unicode-aware, if deleteFile and even getFileStat aren't unicode-aware? i've attached my own internal module that makes this job for my own program - just for reference -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
Hello Shu-yu, Sunday, June 14, 2009, 7:41:46 AM, you wrote:
It seems like getDirectoryContents applies codepage conversion based
it's not a bug, but old-fashioned architecture of entire file apis you may find my Win32Files.hs module useful - it adopts UTF-16 versions of file operations http://downloads.sourceforge.net/freearc/FreeArc-0.51-sources.tar.bz2 -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
participants (7)
-
Bulat Ziganshin -
Duncan Coutts -
Judah Jacobson -
Ketil Malde -
Shu-yu Guo -
Simon Marlow -
Yitzchak Gale