
Hello Simon i'm now write some sort of new i/o library. one area where i currently lacks in comparision to the existing Handles implementation in GHC, is the asynchronous i/o operations. can you please briefly describe how this is done in GHC and partially - why the multiple buffers are used? i'm now use just one buffer, which can contain read or write data, but not both - this buffer is just flushed before switching "mode" of operations. am i lose something due to this simplified algorithm? moreover, i have an idea how to implement async i/o without complex burecreacy: use mmapped files, may be together with miltiple buffers. for example, we can allocate four 16kb buffers. when one buffer is filled with written data, the program unmaps it and switches to use the next buffer. i don't tested it, but OS can guess that unmapped buffer now should be asynchronously written to disk. the same for reading - when we completely read one buffer, we can unmap it, switch to the second buffer and map the third so that the OS can asynchronously fill the third buffer while we are reading second. should this work, at least on the main desktop OSes? at least, mmap/VirtualAlloc available afaik on the all ghc-supported platforms, so this should work anywhere. of course, this scheme omits async i/o on sockets in Windows -- Best regards, Bulat mailto:bulatz@HotPOP.com

On Fri, 2006-01-27 at 13:10 +0300, Bulat Ziganshin wrote:
moreover, i have an idea how to implement async i/o without complex burecreacy: use mmapped files, may be together with miltiple buffers. for example, we can allocate four 16kb buffers. when one buffer is filled with written data, the program unmaps it and switches to use the next buffer. i don't tested it, but OS can guess that unmapped buffer now should be asynchronously written to disk. the same for reading - when we completely read one buffer, we can unmap it, switch to the second buffer and map the third so that the OS can asynchronously fill the third buffer while we are reading second. should this work, at least on the main desktop OSes?
On Linux an probably other unix-like OSes I don't think this would be any different from using read/write. On Linux, read and mmap use the same underlying mechanism - the page cache. The only difference is that with mmap you get zero-copy access to the page cache. However frequent mapping and unmapping may eliminate that advantage. Either way there is no difference in how asynchronous the operations are. For the write case write() is already asynchronous. The data is just copied into the page cache and the OS flushes it some time later at a time of its own choosing. For the read case mmap()ing and then reading will still be synchronous because the file is not mapped until the memory is accessed. At which point the thread that accesses that memory will block to perform the IO. You can use posix_madvise() hints to ask the OS to pre-fault the mmaped pages however the actual implementation of this is still synchronous. The best you can do is to use hints to get the OS to do plenty of read ahead (if you are going to access the file sequentially). This will populate the page cache and so hopefully the data will be immediately available when you ask for it shortly afterwards. However, even this will be of limited additional benefit since the OS can automatically detect linear access patterns and increase the read ahead accordingly. On a related note, if you know that a file will be accessed linearly and then not re-read after that, then you can give a hint to not retain the data in the page cache after it has been read by the process. Duncan

Hello Duncan, Friday, January 27, 2006, 4:00:28 PM, you wrote:
moreover, i have an idea how to implement async i/o without complex burecreacy: use mmapped files, may be together with miltiple buffers. for example, we can allocate four 16kb buffers. when one buffer is filled with written data, the program unmaps it and switches to use the next buffer. i don't tested it, but OS can guess that unmapped buffer now should be asynchronously written to disk. the same for reading - when we completely read one buffer, we can unmap it, switch to the second buffer and map the third so that the OS can asynchronously fill the third buffer while we are reading second. should this work, at least on the main desktop OSes?
DC> On Linux an probably other unix-like OSes I don't think this would be DC> any different from using read/write. DC> On Linux, read and mmap use the same underlying mechanism - the page DC> cache. The only difference is that with mmap you get zero-copy access to DC> the page cache. However frequent mapping and unmapping may eliminate DC> that advantage. Either way there is no difference in how asynchronous DC> the operations are. yes, i want to save exactly this bit of performance - after i optimized all other expenses on the path of text i/o in other words, i interested in having zero-wait operation both for reading and writing, i.e. that in sequence of getChar or putChar actions there were no waits on any action - of course, if the disk is fast enough. in other words, speed of such i/o programs should be the same as if we just write these data to memory current GHC's Handle implementation uses rather complex machinery for async reading and writing, and i can even say that most part of Handle's implementation complexity is due to this async machinery. so i wanna know what exactly accomplished by this implementation and can we implement async operation much easier by using mmap? the word "async" is overloaded here - i'm most interested in having zero-overhead in single-threaded operation, while GHC's optimization, afair, is more about overlapping I/O in one thread with computations in another. so i'm searching for fastest and easiest-to-implement scheme. what you propose? -- Best regards, Bulat mailto:bulatz@HotPOP.com

On Fri, 2006-01-27 at 18:01 +0300, Bulat Ziganshin wrote:
Hello Duncan,
Friday, January 27, 2006, 4:00:28 PM, you wrote:
moreover, i have an idea how to implement async i/o without complex burecreacy: use mmapped files, may be together with miltiple buffers. for example, we can allocate four 16kb buffers. when one buffer is filled with written data, the program unmaps it and switches to use the next buffer. i don't tested it, but OS can guess that unmapped buffer now should be asynchronously written to disk. the same for reading - when we completely read one buffer, we can unmap it, switch to the second buffer and map the third so that the OS can asynchronously fill the third buffer while we are reading second. should this work, at least on the main desktop OSes?
DC> On Linux an probably other unix-like OSes I don't think this would be DC> any different from using read/write.
DC> On Linux, read and mmap use the same underlying mechanism - the page DC> cache. The only difference is that with mmap you get zero-copy access to DC> the page cache. However frequent mapping and unmapping may eliminate DC> that advantage. Either way there is no difference in how asynchronous DC> the operations are.
yes, i want to save exactly this bit of performance - after i optimized all other expenses on the path of text i/o
There is a trade off, using mmap gives you zero-copy access to the page cache however there is a not-insignificant performance overhead in setting up and tearing down memory mappings. This is true on unix and win32. So for small writes (eg 4k blocks) it is likely to be cheaper to just use read()/write() on page aligned buffers rather than use mmap. You would need to do benchmarks on each platform to see which method is quicker. Given the code complexity that other people have mentioned I do not think it would be worth it. Using page aligned and sized buffers can help read()/write() performance on some OSes like some of the BSDs.
in other words, i interested in having zero-wait operation both for reading and writing,
As I said that is not possible with either read() or mmaped read. Conversely it works automatically with write() and mmaped writes. Zero-copy and zero-wait are not the same thing.
i.e. that in sequence of getChar or putChar actions there were no waits on any action - of course, if the disk is fast enough. in other words, speed of such i/o programs should be the same as if we just write these data to memory
current GHC's Handle implementation uses rather complex machinery for async reading and writing, and i can even say that most part of Handle's implementation complexity is due to this async machinery. so i wanna know what exactly accomplished by this implementation and can we implement async operation much easier by using mmap?
the word "async" is overloaded here - i'm most interested in having zero-overhead in single-threaded operation, while GHC's optimization, afair, is more about overlapping I/O in one thread with computations in another. so i'm searching for fastest and easiest-to-implement scheme. what you propose?
An important factor for optimising IO performance is using sufficiently large block sizes to avoid making frequent kernel calls. That includes read()/write() calls and mmap()/unmap() calls. Perhaps it is possible to move the complexity needed for the lazy hPutStr case into the hPutStr implementation rather than the Handle implementation. For example perhaps it'd be possible for the Handle to just have one buffer but to have a method for writing out an external buffer that is passed to it. Then hPutStr would allocate it's own buffer, evaluate the string, copying it into the buffer. Then it would call on the Handle to write out the buffer. The Handle would flush its existing internal buffer and write out the extra buffer. Perhaps a better solution for your single-threaded operation case is to have a handle type that is a bit specialised and does not have to deal with the general case. If we're going to get a I/O system that supports various layers and implementations then perhaps you could have an one that implements only the minimal possible I/O class. That could not use any thread locks (ie it'd not work predictably for multiple Haskell threads) and use mmap on the entire file. So you wouldn't get the normal feature that a file extends at the end as it's written to, it'd need a method for determining the size at the beginning or extending it in large chunks. On the other hand it would not need to manage any buffers since reads and writes would just be reads to/from memory. So it'd depend on what the API of the low level layers of the new I/O system are like as to whether such a simple and limited implementation would be possible. Duncan

Hello Duncan, Saturday, January 28, 2006, 3:08:04 PM, you wrote:
yes, i want to save exactly this bit of performance - after i optimized all other expenses on the path of text i/o
DC> There is a trade off, using mmap gives you zero-copy access to the page DC> cache however there is a not-insignificant performance overhead in DC> setting up and tearing down memory mappings. This is true on unix and DC> win32. So for small writes (eg 4k blocks) it is likely to be cheaper to DC> just use read()/write() on page aligned buffers rather than use mmap. DC> You would need to do benchmarks on each platform to see which method is DC> quicker. Given the code complexity that other people have mentioned I do DC> not think it would be worth it. i use 64k buffers and tried mmapped files last night. it's not easy to properly implement this and then to ensure good speed. at least windows very lazily flushes the buffers that was filled using mmap. when i wrote 1 gb file in this mode, windows tried to swap out all programs and itself but delayed writing of already unmapped data! DC> Using page aligned and sized buffers can help read()/write() performance DC> on some OSes like some of the BSDs. i will try to cutout aligned 64k buffer inside 128k block and will publish this code here so anyone can test it on his OS
in other words, i interested in having zero-wait operation both for reading and writing,
DC> As I said that is not possible with either read() or mmaped read. DC> Conversely it works automatically with write() and mmaped writes. DC> Zero-copy and zero-wait are not the same thing. i mean that mmap guarantee us zero-copy operation and i wish to use mmap in such way that zero-wait operation can be ensured DC> An important factor for optimising IO performance is using sufficiently DC> large block sizes to avoid making frequent kernel calls. That includes DC> read()/write() calls and mmap()/unmap() calls. that's true and easy to implement DC> Perhaps it is possible to move the complexity needed for the lazy DC> hPutStr case into the hPutStr implementation rather than the Handle DC> implementation. For example perhaps it'd be possible for the Handle to DC> just have one buffer but to have a method for writing out an external DC> buffer that is passed to it. Then hPutStr would allocate it's own DC> buffer, evaluate the string, copying it into the buffer. Then it would DC> call on the Handle to write out the buffer. The Handle would flush its DC> existing internal buffer and write out the extra buffer. 1) "lazy hPutStr" is not some rare case. we can't distinguish strict and lazy strings with current GHC and in any hPutStr invocation we should assume that evaluation of its argument can lead to side effects. that is the whole problem - we want to optimize hPutStr for the fast work with strict strings, but need to ensure that it will work correctly even with slow lazy strings having any side effects 2) the scheme above can be implemented using hPutBuf to write this additional buffer. it's just less efficient (although is not so much - memcpy works 10 times faster than traversing of [Char]) on the other side, Simon don't counted that locking itself is rather slow and using two locks instead of one lead to some slowness of his scheme, especially on small strings DC> Perhaps a better solution for your single-threaded operation case is to DC> have a handle type that is a bit specialised and does not have to deal DC> with the general case. If we're going to get a I/O system that supports DC> various layers and implementations then perhaps you could have an one DC> that implements only the minimal possible I/O class. That could not use DC> any thread locks (ie it'd not work predictably for multiple Haskell DC> threads) moreover - we can implement locking as special "converter" type, that can be applied to any mutable object - stream, collection, counter. that allows to simplify implementations and add locking only to those Streams where we really need it. like these: h <- openFD "test" >>= addUsingOfSelect >>= addBuffering 65536 >>= addCharEncoding utf8 >>= attachUserData dictionary >>= addLocking DC> and use mmap on the entire file. So you wouldn't get the normal DC> feature that a file extends at the end as it's written to, it'd need a DC> method for determining the size at the beginning or extending it in DC> large chunks. On the other hand it would not need to manage any buffers DC> since reads and writes would just be reads to/from memory. yes, i done it. but simple MapViewOfFile/UnMapViewOfFile don't work well enough, at least on writing. windows don't hurry to flush these buffers, even after unmap, and using of flushViewOfFile results in synchronous flushing of buffer to the cache. so i need to try doing flushViewOfFile in separate thread, like the GHC does for its i/o DC> So it'd depend on what the API of the low level layers of the new I/O DC> system are like as to whether such a simple and limited implementation DC> would be possible. it's no problem. m/m file just implements API that tells the user the address/size of the next buffer to fill/read. this interface used also for plain memory buffers and interprocess communications via shared memory: -- | Receive next buffer which contains data / must be filled with data vReceiveBuf :: (Integral size) => h -> ReadWrite -> m (Ptr a, size) -- | Release buffer after reading `len` bytes / Send buffer filled with `len` bytes vSendBuf :: (Integral size, Integral len) => h -> Ptr a -> size -> len -> m () -- Best regards, Bulat mailto:bulatz@HotPOP.com

Bulat Ziganshin wrote:
moreover - we can implement locking as special "converter" type, that can be applied to any mutable object - stream, collection, counter. that allows to simplify implementations and add locking only to those Streams where we really need it. like these:
h <- openFD "test" >>= addUsingOfSelect >>= addBuffering 65536 >>= addCharEncoding utf8 >>= attachUserData dictionary >>= addLocking
This is really nice - exactly what I'd like to see in the I/O library. The trick is making it perform well, though... but I'm sure that's your main focus too. Still, I'm not sure that putting both input and output streams in the same type is the right thing, I seem to remember a lot of things being simpler with them separate. Cheers, Simon

Hello Simon, Wednesday, February 01, 2006, 2:26:45 PM, you wrote: SM> Bulat Ziganshin wrote: i can now report about memory-mapped files in Windows. seems that their support is not intended for implementing sequential file access but exclusively for database-like files what be hold in memory as much as possible and requires random access. at least, i used buffers of 64k and unmap them as soon as each buffer is filled. nevertheless, windows delayed writing to disk already filled and unmapped buffers as much as possible - even prefer to swap to disk OS and running programs rather than flushung this data (when memory was filled by 80% with the data of m/m file)! when i tried to sequentially read this file, the results was slower than for simple read() calls, and i think that is because data was not prefetched despite sequential access. i will implement this stream type ultimately, but it will be more for some special purposes than fir everyday use or an FD replacement
moreover - we can implement locking as special "converter" type, that can be applied to any mutable object - stream, collection, counter. that allows to simplify implementations and add locking only to those Streams where we really need it. like these:
h <- openFD "test" >>= addUsingOfSelect >>= addBuffering 65536 >>= addCharEncoding utf8 >>= attachUserData dictionary >>= addLocking
SM> This is really nice - exactly what I'd like to see in the I/O library. SM> The trick is making it perform well, though... but I'm sure that's your SM> main focus too. basically idea is very simple - every stream implements the Stream interface, what is clone of Handle interface. Stream transformers is just a types what has Stream parameters and in turn implement the same interface. all Stream operations are translated to calls of inner Stream. typical example: data WithLocking h = WithLocking h (MVar ()) -- | Add lock to object to ensure its proper use in concurrent threads withLocking h = do mvar <- newMVar () return (WithLocking h mvar) instance (Stream IO h) => Stream IO (WithLocking h) where vMkIOError (WithLocking h _) = vMkIOError h vClose (WithLocking h mvar) = withMVar mvar (\_ -> vClose h) vIsEOF (WithLocking h mvar) = withMVar mvar (\_ -> vIsEOF h) vReady (WithLocking h mvar) = withMVar mvar (\_ -> vReady h) vSeek (WithLocking h mvar) a b = withMVar mvar (\_ -> vSeek h a b) ............... as a result, these Stream Transformers can be applied recursively, forming a complex type that is still a proper Stream! for example, type of 'h' in the h <- openRawFD "test" WriteMode >>= bufferBlockStream >>= withEncoding utf8 >>= withLocking will be "WithLocking (WithEncoding (BufferedBlockStream FD))" moreover, this locking transformer is universal - it can be applied, for example, to IOArray and resulting value will still implement MArray interface! it's one of reasons for that i popularize using class interfaces for all standard libraries and especially data structures SM> Still, I'm not sure that putting both input and output streams in the SM> same type is the right thing, I seem to remember a lot of things being SM> simpler with them separate. i'm interested to hear that things was simpler? in my design it seems vice versa - i will need to write far more code to separate read, write and read-write FDs, for example. may be, the difference is because i have one large Stream class that implements all the functions while your design had a lot of classes with a few functions in each -- Best regards, Bulat mailto:bulatz@HotPOP.com

On 27.01 13:10, Bulat Ziganshin wrote:
i'm now write some sort of new i/o library. one area where i currently lacks in comparision to the existing Handles implementation in GHC, is the asynchronous i/o operations. can you please briefly describe how this is done in GHC and partially - why the multiple buffers are used?
One simple optimization is that you can omit all buffering with unbuffered operation. Then simply add the buffer (which is ok because Handles are mutable) if the user ever calls hLookAhead.
moreover, i have an idea how to implement async i/o without complex burecreacy: use mmapped files, may be together with miltiple buffers. for example, we can allocate four 16kb buffers. when one buffer is filled with written data, the program unmaps it and switches to use the next buffer. i don't tested it, but OS can guess that unmapped buffer now should be asynchronously written to disk. the same for reading - when we completely read one buffer, we can unmap it, switch to the second buffer and map the third so that the OS can asynchronously fill the third buffer while we are reading second. should this work, at least on the main desktop OSes?
Please no. There are multiple reasons to avoid mmapped files. 1) They make very few performance guarantees for reading (i.e. a Haskell thread touches memory which has not yet been read from the file causing IO and all the other Haskell threads are blocked too) 2) The time of writes is unpredictable making implementing a hFlush harder? (not sure about this) 3) Not all file descriptors will support it - i.e. we will need the read/write path in any case. 4) Mmap cannot be used for random access for arbitrary files since they may be larger than the address space. This means some kind of window needs to be implemented - and this is easily done with read/write. - Einar Karttunen

Hello Einar, Friday, January 27, 2006, 4:19:55 PM, you wrote: EK> One simple optimization is that you can omit all buffering with EK> unbuffered operation. Then simply add the buffer (which is ok EK> because Handles are mutable) if the user ever calls hLookAhead. yes, i do it
moreover, i have an idea how to implement async i/o without complex burecreacy: use mmapped files, may be together with miltiple buffers. for example, we can allocate four 16kb buffers. when one buffer is filled with written data, the program unmaps it and switches to use the next buffer. i don't tested it, but OS can guess that unmapped buffer now should be asynchronously written to disk. the same for reading - when we completely read one buffer, we can unmap it, switch to the second buffer and map the third so that the OS can asynchronously fill the third buffer while we are reading second. should this work, at least on the main desktop OSes?
EK> Please no. There are multiple reasons to avoid mmapped files. EK> 1) They make very few performance guarantees for reading EK> (i.e. a Haskell thread touches memory which has not yet EK> been read from the file causing IO and all the other EK> Haskell threads are blocked too) yes, it seems that using mmapped file may slowdown such program EK> 2) The time of writes is unpredictable making implementing a EK> hFlush harder? (not sure about this) i can say only about Windows - here FlushViewOfFile() do it EK> 3) Not all file descriptors will support it - i.e. we will EK> need the read/write path in any case. i don't understand what you mean, can you please explain futher? EK> 4) Mmap cannot be used for random access for arbitrary files EK> since they may be larger than the address space. This means EK> some kind of window needs to be implemented - and this is EK> easily done with read/write. that's not true, at least for Windows - see MapViewOfFile() -- Best regards, Bulat mailto:bulatz@HotPOP.com

Bulat Ziganshin wrote:
i'm now write some sort of new i/o library. one area where i currently lacks in comparision to the existing Handles implementation in GHC, is the asynchronous i/o operations. can you please briefly describe how this is done in GHC and partially - why the multiple buffers are used?
Multiple buffers were introduced to cope with the semantics we wanted for hPutStr. The problem is that you don't want hPutStr to hold a lock on the Handle while it evaluates its argument list, because that could take arbitrary time. Furthermore, things like this: putStr (trace "foo" "bar") used to cause deadlocks, because putStr holds the lock, evaluates its argument list, which causes trace to also attempt to acquire the lock on stdout, leading to deadlock. So, putStr first grabs a buffer from the Handle, then unlocks the Handle while it fills up the buffer, then it takes the lock again to write the buffer. Since another thread might try to putStr while the lock is released, we need multiple buffers. For async IO on Unix, we use non-blocking read() calls, and if read() indicates that we need to block, we send a request to the IO Manager thread (see GHC.Conc) which calls select() on behalf of all the threads waiting for I/O. For async IO on Windows, we either use the threaded RTS's blocking foreign call mechanism to invoke read(), or the non-threaded RTS has a similar mechanism internally. We ought to be using the various alternatives to select(), but we haven't got around to that yet.
moreover, i have an idea how to implement async i/o without complex burecreacy: use mmapped files, may be together with miltiple buffers.
I don't think we should restrict the implementation to mmap'd files, for all the reasons that Einar gave. Lots of things aren't mmapable, mainly. My vision for an I/O library is this: - a single class supporting binary input (resp. output) that is implemented by various transports: files, sockets, mmap'd files, memory and arrays. Windowed mmap is an option here too. - layers of binary filters on top of this: you could add buffering, and compression/decompression. - a layer of text translation at the top. This is more or less how the Stream-based I/O library that I was working on is structured. The binary I/O library would talk to a binary transport, perhaps with a layer of buffering, whereas text-based applications talk to the text layer. Cheers, Simon

Hello Simon, Friday, January 27, 2006, 7:25:44 PM, you wrote:
i'm now write some sort of new i/o library. one area where i currently lacks in comparision to the existing Handles implementation in GHC, is the asynchronous i/o operations. can you please briefly describe how this is done in GHC and partially - why the multiple buffers are used?
SM> Multiple buffers were introduced to cope with the semantics we wanted SM> for hPutStr. thank you. i was read hPutStr comments, but don't understood that this problem is the only cause of introducing multiple buffers SM> The problem is that you don't want hPutStr to hold a lock SM> on the Handle while it evaluates its argument list, because that could SM> take arbitrary time. Furthermore, things like this: SM> putStr (trace "foo" "bar") SM> used to cause deadlocks, because putStr holds the lock, evaluates its SM> argument list, which causes trace to also attempt to acquire the lock on SM> stdout, leading to deadlock. SM> So, putStr first grabs a buffer from the Handle, then unlocks the Handle SM> while it fills up the buffer, then it takes the lock again to write the SM> buffer. Since another thread might try to putStr while the lock is SM> released, we need multiple buffers. i don't understand the last sentence. you are said about problems with performing I/O inside computation of putStr argument, not about another thread? i understand that locks basically needed because multiple threads can try to do i/o with the same Handle simultaneously SM> For async IO on Unix, we use non-blocking read() calls, and if read() SM> indicates that we need to block, we send a request to the IO Manager SM> thread (see GHC.Conc) which calls select() on behalf of all the threads SM> waiting for I/O. For async IO on Windows, we either use the threaded SM> RTS's blocking foreign call mechanism to invoke read(), or the SM> non-threaded RTS has a similar mechanism internally. so, async I/O in GHC is have nothing common with "zero-wait operation" in single-threaded environment and can only help to overlap i/o in one thread with execution of other threads? SM> We ought to be using the various alternatives to select(), but we SM> haven't got around to that yet. yes, i read these threads and even remember Trac ticket about this. btw, in the typeclasses-based i/o library this facility can be added as additional middle layer, in the same way as buffering and Char encoding. i even think that it can be done as 3-party library, w/o any changes to the main library itself
moreover, i have an idea how to implement async i/o without complex burecreacy: use mmapped files, may be together with miltiple buffers.
SM> I don't think we should restrict the implementation to mmap'd files, for SM> all the reasons that Einar gave. Lots of things aren't mmapable, mainly. i'm interested because mmap can be used to speed up i/o-bound programs. but it seems that m/m files can't be used to overlap i/o in multi-threaded applications. anyway, i use class-based design so at least we can provide m/m files as one of Stream instances SM> My vision for an I/O library is this: SM> - a single class supporting binary input (resp. output) that is SM> implemented by various transports: files, sockets, mmap'd files, SM> memory and arrays. Windowed mmap is an option here too. i don't consider fully-mapped files as an separate instance, because they can be simulated by using window-mapped files with large window SM> - layers of binary filters on top of this: you could add buffering, SM> and compression/decompression. SM> - a layer of text translation at the top. SM> This is more or less how the Stream-based I/O library that I was working SM> on is structured. SM> The binary I/O library would talk to a binary transport, perhaps with a SM> layer of buffering, whereas text-based applications talk to the text layer. that's more or less close to what i do. it is no wonder - i was substantially influenced by the design of your "new i/o" library. the only difference is that i use one Stream class for any streams -- Best regards, Bulat mailto:bulatz@HotPOP.com
participants (4)
-
Bulat Ziganshin
-
Duncan Coutts
-
Einar Karttunen
-
Simon Marlow