RE: adding to GHC/win32 Handle operations support of Unicode filenamesand files larger than 4 GB

This sounds like a good idea to me. As far as possible, we should keep the platform-dependence restricted to the implementation of one module (System.Posix.Internals will do, even though this isn't really POSIX any more). So System.Posix.Internals exports the CFilePath/CFileOffset types, and the foreign functions that operate on them. Alternatively (and perhaps this is better), we could hide the difference even further, and provide functions like rmDir :: FilePath -> IO CInt in System.Posix.Internals. Similarly for functions that operate on COff, they would take/return Integer (eg. we already have System.Posix.fdFileSize). As regards whether to use feature tests or just #ifdef mingw32_HOST_OS, in general feature tests are the right thing, but sometimes it doesn't buy you very much when there is (and always will be) only one platform that has some particular quirk. Writing a bunch of autoconf code that would, if we're lucky, handle properly the case when some future version of Windows removes the quirk, is not a good use of developer time. Furthermore, Windows hardly ever changes APIs, they just add new ones. So I don't see occasional use of #ifdef mingw32_HOST_OS as a big deal. It's more important to organise the codebase and make sure all the #ifdefs are behind suitable abstractions. Cheers, Simon On 21 November 2005 12:01, Bulat Ziganshin wrote:
Simon, what you will say about the following plan?
ghc/win32 currently don't support operations with files with Unicode filenames, nor it can tell/seek in files for positions larger than 4 GB. it is because Unix-compatible functions open/fstat/tell/... that is supported in Mingw32 works only with "char[]" for filenames and off_t (which is 32 bit) for file sizes/positions
half year ago i discussed with Simon Marlow how support for unicode names and large files can be added to GHC. now i implemented my own library for such files, and got an idea how this can incorporated to GHC with minimal efforts:
GHC currently uses CString type to represent C-land filenames and COff type to represent C-land fileseizes/positions. We need to systematically change these usages to CFilePath and CFileOffset, respectively, defined as follows:
#ifdef mingw32_HOST_OS type CFilePath = LPCTSTR type CFileOffset = Int64 withCFilePath = withTString peekCFilePath = peekTString #else type CFilePath = CString type CFileOffset = COff withCFilePath = withCString peekCFilePath = peekCString #endif
and of course change using of withCString/peekCString, where it is applied to filenames, to withCFilePath/peekCFilePath (this will touch modules System.Posix.Internals, System.Directory, GHC.Handle)
the last change needed is to conditionally define all "c_*" functions in System.Posix.Internals, whose types contain references to filenames or offsets:
#ifdef mingw32_HOST_OS foreign import ccall unsafe "HsBase.h _wrmdir" c_rmdir :: CFilePath -> IO CInt .... #else foreign import ccall unsafe "HsBase.h rmdir" c_rmdir :: CFilePath -> IO CInt .... #endif
(note that actual C function used is _wrmdir for Windows and rmdir for Unix). of course, all such functions defined in HsBase.h, also need to be defined conditionally, like:
#ifdef mingw32_HOST_OS INLINE time_t __hscore_st_mtime ( struct _stati64* st ) { return st->st_mtime; } #else INLINE time_t __hscore_st_mtime ( struct stat* st ) { return st->st_mtime; } #endif
That's all! of course, this will broke compatibility with current programs which directly uses these c_* functions (c_open, c_lseek, c_stat and so on). this may be issue for some libs. are someone really use these functions??? of course, we can go in another, fully backward-compatible way, by adding some "f_*" functions and changing high-level modules to work with these functions

Hello Simon, Wednesday, November 23, 2005, 2:22:02 PM, you wrote: SM> This sounds like a good idea to me. SM> As far as possible, we should keep the platform-dependence restricted to SM> the implementation of one module (System.Posix.Internals will do, even SM> though this isn't really POSIX any more). So System.Posix.Internals SM> exports the CFilePath/CFileOffset types, and the foreign functions that SM> operate on them. SM> Alternatively (and perhaps this is better), we could hide the difference SM> even further, and provide functions like SM> rmDir :: FilePath -> IO CInt SM> in System.Posix.Internals. Similarly for functions that operate on SM> COff, they would take/return Integer (eg. we already have SM> System.Posix.fdFileSize). well... but not well :) let's consider function c_open for more informative example. between functions c_open and openFile there is several levels of "translation": 1) convert C types to Haskell types 2) check for errno and raise exception on error 3) convert interfaces (translate IOMode to CMode in this example) 4) convert file descriptors to Handles you suggestion is to build middle-level library whose functions lie between step 1 and 2 in this scheme: c_open :: CFilePath -> CInt -> CMode -> IO CInt 1) convert C types to Haskell types open :: String -> Int -> CMode -> IO Int 2) check for errno 3) convert interfaces 4) convert file descriptors to Handles This have one obvious benefit - these functions will look very like to its C counterparts. but on the other side, resulting functions will not belong to C, nor to Haskell world - they will use Haskell types but C-specific error signalling moreover, adding such middle-level functions will not help making implementation simpler - all differences between platforms are already covered by definitions of CFilePath/CFileOffset/withCFilePath/peekCFilePath but i propose to make these middle-level functions after stage 2 or even 3 in this scheme - so that they will be fully in Haskell world, only work with file descriptors instead of Handles. for example: lseek :: Integral int => FD -> SeekMode -> int -> IO () lseek h direction offset = do let whence :: CInt whence = case mode of AbsoluteSeek -> sEEK_SET RelativeSeek -> sEEK_CUR SeekFromEnd -> sEEK_END throwErrnoIfMinus1Retry_ "lseek" $ c_lseek (fromIntegral h) (fromIntegral offset) direction profits: 1) current GHC.Handle code is monolithic, it performs all these 4 steps of translation in one function. this change will simplify this module and concenrate it on solving only one, most complex, task - implementing operations on Handles via operations on FDs 2) part of code in GHC.Handle, what is not really GHC-specific, will be moved to standard hierarchical libraries, where it will become ready to use by other Haskell implementations 3) alternative Handle implementations can use these middle-level functions and not reinvent the wheel. just for example - in http://haskell.org/~simonmar/new-io.tar.gz openFile code is mostly copied from existing GHC.Handle 4) we will get full-fledged FD library on GHC, Hugs and NHC for free 5) if this FD library will have Handle-like interface, it can be used as "poor men's" drop-in replacement of Handle library in situations where we don't need its buffering and other advanced features so, as first step i propose to move middle-level code from GHC.Handle to Posix.Internals, join FD type definitions, replace CString with CFilePath where appropriate, and so on. and only after this - make changes specific for windows. i can do it all. what you will say?
That's all! of course, this will broke compatibility with current programs which directly uses these c_* functions (c_open, c_lseek, c_stat and so on). this may be issue for some libs. are someone really use these functions??? of course, we can go in another, fully backward-compatible way, by adding some "f_*" functions and changing high-level modules to work with these functions
if my changes will be committed only to GHC 6.6 (HEAD) branch, the problem that types of c_* functions is changed will not be a big problem - you anyway change some interfaces between major releases. but now i'm realized that Posix.Internals is part of libraries common for several Haskell compilers. can such changes break their working? moreover, i plan to move "throwErrnoIfMinus1RetryOnBlock" to Foreign.C.Error, and sEEK_CUR/sEEK_SET/sEEK_END - to Posix.Internals. can it be done? SM> As regards whether to use feature tests or just #ifdef mingw32_HOST_OS, SM> in general feature tests are the right thing, but sometimes it doesn't SM> buy you very much when there is (and always will be) only one platform SM> that has some particular quirk. Writing a bunch of autoconf code that SM> would, if we're lucky, handle properly the case when some future version SM> of Windows removes the quirk, is not a good use of developer time. SM> Furthermore, Windows hardly ever changes APIs, they just add new ones. SM> So I don't see occasional use of #ifdef mingw32_HOST_OS as a big deal. SM> It's more important to organise the codebase and make sure all the SM> #ifdefs are behind suitable abstractions. so i will write the following: -- Support for Unicode filenames and files>4GB #ifdef mingw32_HOST_OS in ALL the places where this feature test must take place. it will document the code and give ability to easily find/edit all these places if this will be needed sometime in the future can i also ask several questions about "new i/o" library? as i see, this library solves 3 problems: 1) having several streams in 1 file. why it is better than using just hDuplicate? 2) using different Char encodings on the streams. i think that it can be better done by renaming current hGetChar/hPutChar to hGetByte/hPutByte and adding different encodings just as different "hGetByte->hGetChar" strategies. in this way memory buffers will always hold untranslated chars and Handle structure will contain the following fields: data Handle__ = Handle__ { ... haPutChar :: (Word8 -> IO ()) -> Char -> IO (), haGetChar :: (IO Word8) -> IO Char ... } these fields will be modified by hSetEncoding operation 3) using Handles to access memory/sockets/pipes and so on. can this be solved in the same way as previous problem - by defining class Stream: class Stream where sPutBuf, sGetBuf, sSeek, .... and incorporating in Handle instance of these class instead of haFD: data Handle__ = Handle__ { haStream :: forall s . Stream s => s ? -- Best regards, Bulat mailto:bulatz@HotPOP.com

Hello Bulat, Thursday, November 24, 2005, 4:17:24 AM, you wrote: BZ> but i propose to make these middle-level functions after stage 2 or BZ> even 3 in this scheme - so that they will be fully in Haskell world, BZ> only work with file descriptors instead of Handles. for example: "it's better one time to see, than 100 times to hear", so i attached my current Win32Files.hs to this letter. it's close to the FD library i propose, only open/seek functions need to be rewritten. of course, i will adapt this code to programming style used in the hierrachical libraries by implementing all the things i wrote in previous letter we will get i/o library, defined as stack of APIs: 1) c_* functions 2) FD API (and other Streams - memory buffer, String, socket, pipe, channel, mvar and so on) 3) Stream buffering API (Handle) 4) Char encoding for Handles API -- Best regards, Bulat mailto:bulatz@HotPOP.com
participants (2)
-
Bulat Ziganshin
-
Simon Marlow