Re: Streams: the extensible I/O library

Bulat Ziganshin wrote:
I have developed a new I/O library that IMHO is so sharp that it can eventually replace the current I/O facilities based on using Handles. The main advantage of the new library is its strong modular design using typeclasses.
I've taken a brief look, and I must say it's nice (but you knew that anyway :-)). My initial thoughts: I would prefer to see more type structure, rather than putting everything in the Stream class. You have classes ByteStream, BlockStream etc, but these are just renamings of the Stream class. There are many compositions that are illegal, but we don't find out until runtime; it would make a lot more sense to me to expose this structure in the type system. My view is that the most basic level of stream is a byte stream, supporting only two operations: read an array of bytes and write an array of bytes, i.e. vGetBuf/vPutBuf. This makes implementing a stream, or transformer, much easier and shorter. Also I'd like to see separate input/output streams for even more type safety, and I believe simplicity, but this is less important than separating byte streams from text streams. I believe certain other operations would benefit from being moved into separate classes: eg. vSeek into a Seekable class, vSetBuffering into a Buffered class, and so on. This will improve performance too - your Stream class has dictionaries with 20+ elements. I see that buffering works on vPutChar/vGetChar, and yet you seem to be buffering bytes - which is it? Are you supposed to buffer before or after doing character encoding? It seems before, because otherwise buffering will strip out all but the low 8 bits of each character. Using a more explicit type structure would help a lot here. Incedentally, I'm suprised that you can use list-based character encoding/decoding and still get good performance, I expected to need to do encoding directly between buffers. Still, as I said, I think the general approach is excellent, and is definitely heading in the right direction. Oh, and some of the code is GPL'd, which is a problem for incorporation in standard libraries. This is just something to bear in mind if the aim is for this to be a candidate for a/the standard IO library. Cheers, Simon

Hello Simon, Wednesday, February 08, 2006, 2:58:30 PM, you wrote: SM> I would prefer to see more type structure, rather than putting SM> everything in the Stream class. You have classes ByteStream, SM> BlockStream etc, but these are just renamings of the Stream class. There SM> are many compositions that are illegal, but we don't find out until SM> runtime; it would make a lot more sense to me to expose this structure SM> in the type system. i initially used normal splitted classes (vGetBuf was in BlockStream) and so on, but come accross problems with the type classes system and decided to simplify the design. now i feel himself more confident with the classes, feel that i know source of my previous problems and therefore slowly migrate back to the splitted classes design. the library as published is just on the half of this way. but i know some limitations. that is the one problem: data BinHandle = forall h . (Stream IO h) => BinH h with such definition, i cannot use for BinHandles any operations that is aside of Stream interface. BinHandles, like in the NewBinary library, can be constructed from files or memory buffers, and memory buffers should support "saveToFile" operation. this operation require the "MemoryStream" interface, what implement by memory buffers, but not by files. in the old implementation, all operations was in the Stream interface, so i can implement "saveToFile" and this operation generated run-time error when used not with memory buffers. now it's inpossible to use it, i need to add second field of type (Maybe MemoryStream) the same problem will appear for all "forall h . Stream h" datatypes - if they need some operations from additional interfaces, then additional fields should be introduced, quantified by these interfaces moreover, splitting the Streams interface will require from the library users to give more classes in defining context for their functions, like the: process :: (Stream IO h, Seekable IO h, Buffered h) => h -> IO () that is not so good, especially if adding new interfaces means slowdown of calls to this function SM> My view is that the most basic level of stream is a byte stream, SM> supporting only two operations: read an array of bytes and write an SM> array of bytes, i.e. vGetBuf/vPutBuf. This makes implementing a stream, SM> or transformer, much easier and shorter. it is exact what is implemented, except for: there are two types of low-level streams. for memory-resident streams it is inefficient to work through getbuf/putbuf operations. MemoryStream interface impelements instead vReceiveBuf operation which just returns address and size of next data block in memory. accordingly, the buffering implementation slightly changes - it is the reason why i implemented FileBuffering.hs (working through GetBuf/PutBuf) and MemoryBuffering.hs (working through ReceiveBuf/SendBuf) moreover, there are third type of streams - based on the getchar/putchar or getbyte/putbyte operations. example of former is StringBuffer, later - UArray Int Word8 (not implemented, but possible in future) SM> Also I'd like to see separate SM> input/output streams for even more type safety, and I believe SM> simplicity, it will be great! but it is very uneasy and even seems impossible: 1) this will prevent dividing streams into the MemoryStream/BlockStream/ByteStream, what i like you consider as more important. it is impossible to say what InputStream BlockStream implements only vGetBuf, while OutputStream BlockStream implements only vPutBuf operation 2) such division will require to implement 2 or 3 (+ReadWrite) times more Stream types than now. Say, instead of FD we will get InputFD and OutputFD, instead of CharEncoding transformer - two transformers and so on. most of the functionality in Input and Ouput variants will be repeated (because this functionality don't depend on input/output mode) and in addition to the current large lists of passed calls like the: vIsEOF (WithEncoding h _) = vIsEOF h vMkIOError (WithEncoding h _) = vMkIOError h vReady (WithEncoding h _) = vReady h vIsReadable (WithEncoding h _) = vIsReadable h we will get the same lists in 2 or 3 repetitions!!! 3) i don't think that we can completely throw away the r/w streams, they can be required for example for database-style access. and if we need to implement this type of streams, our win in separating implementations of Input and Output streams will become a loss moreover, difference between input and output streams are well-known and errors in this area can be easily spotted by the users. so i think that such division would be great, but it requires too much work and will essentially compilcate the library. when the Haskell class system will be essenially improved, it will have sense. differences between MemoryStream, ByteStream and so on is not so obvious (because it's specific to this library), so dividing them should help users to spot errors earlier SM> but this is less important than separating byte streams from SM> text streams. I believe certain other operations would benefit from SM> being moved into separate classes: eg. vSeek into a Seekable class, SM> vSetBuffering into a Buffered class, and so on. at this moment the following classes exists: Stream BlockStream (implements vGetBuf/vPutBuf, used for FD) MemoryStream (implements vReceiveBuf/vSendBuf, used for MMFile and MemBuf) ByteStream (implements vGetByte/vPutByte, vGetChar/vPutChar and other text i/o operations) Stream class implements all other operations (just now it implements everything, but i will move the methods). i agree about moving seek/tell operations into the separate class, but not sure about buffering - its absence don't bother anyone and using the separate class will just create problems. moreover, vSetBuffering is not available for buffered MemoryStream. i prefer to distinguish buffered streams by implementation of ByteStream interface - this interface tells that byte- and text-oriented i/o is available, irrespective of concrete implementation why ByteStream implements both the byte and text i/o? i think that in most cases people are still using latin-1 text i/o - i.e. each char is just 8 bits without any encoding. because that type of text i/o don't need any complex implememtation, each time when byte i/o is implemented, text i/o springs automatically. on the other side, utf-8 encoding is rare and therefore separate transformer is used to implement it - it transforms each char i/o call into several byte i/o calls. i definitely against implementing text i/o only through the encoding transformer because it will slowdown the i/o while in 90% cases encoding will not be used. moreover, there are a 3 Stream types, developed by John Goerzen, where Char is used as minimal unit of stream data. this implementation allows us to use these streams to carry full unicode char in each Char or use each Char as the 8-bit-only container. if you, knowing all these, still recommend to change something, i'm all ears :) SM> This will improve SM> performance too - your Stream class has dictionaries with 20+ elements. here you are king - i don't know whether it's better to have one class with 20 methods or 2 classes with 5 methods each in the function context? SM> I see that buffering works on vPutChar/vGetChar, and yet you seem to be SM> buffering bytes - which is it? Are you supposed to buffer before or SM> after doing character encoding? It seems before, because otherwise SM> buffering will strip out all but the low 8 bits of each character. SM> Using a more explicit type structure would help a lot here. buffer contains bytes, which are read/written by the getbyte/putbyte operations as well as the all text i/o. this is latin1-only solution, of course. if one need utf-8 encoding, he need to apply CharEncoding transformer SM> Incedentally, I'm suprised that you can use list-based character SM> encoding/decoding and still get good performance, I expected to need to SM> do encoding directly between buffers. i reported only the speed of the buffering transformers. this don't include speed of char encoding that should be very low at this time. i want to try monadic operations here, and expect to get reasonable performance, 20-30 mb/s: -- | Read UTF-8 encoded char using `action` to get each byte utf8Decode :: (Monad m) => m Word8 -> m Char utf8Decode action = do c <- action if c < 0x80 then return (chr c) .... -- | Write UTF-8 encoded char using `action` to put each byte utf8Encode :: (Monad m) => Char -> (Word8 -> m ()) -> m () utf8Encode c action | ord c < 0x80 = action (ord c) .... vGetChar h = utf8Decode (vGetByte h) vPutChar h c = utf8Encode c (vPutByte h) what you will say about this plan? implementing this is much easier than using iconv (although you are already implemented iconv usage), and moreover it don't need presence of iconv. moreover, these monadic converters is anyway required for the "instance Binary Char" implementation SM> Still, as I said, I think the general approach is excellent, and is SM> definitely heading in the right direction. SM> Oh, and some of the code is GPL'd, which is a problem for incorporation SM> in standard libraries. This is just something to bear in mind if the SM> aim is for this to be a candidate for a/the standard IO library. i will ask John Goerzen about this -- Best regards, Bulat mailto:bulatz@HotPOP.com

Bulat Ziganshin
i reported only the speed of the buffering transformers. this don't include speed of char encoding that should be very low at this time.
Recoding will be slow if it's done on top of buffering and if encoding itself has heavy startup. Buffering should be on the very top, so it amortizes the cost of starting the recoder. It should be possible to use iconv for recoding. Iconv works on blocks and it should not be applied to one character at a time. Byte streams and character streams should be distinguished in types, preferably by class-constrained parametric polymorphism. In particular byte buffers and char buffers should be reperesented differently, so block copying between byte streams moves whole blocks of memory. I have designed and implemented these issues for my language Kogut, and now I'm trying to port them to Haskell. Static typing gets in the way in various places (hiding the type behind an existentially qualified type, passing optional named arguments), but it looks doable. The design uses buffers internally. A buffer is a queue of characters or bytes, with amortized O(1) cost of operating at an end, and fast block operations. A buffer itself is a stream too (reading eats its contents from the beginning, writing appends to the end). -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Hello Marcin, Sunday, February 12, 2006, 8:04:26 PM, you wrote:
i reported only the speed of the buffering transformers. this don't include speed of char encoding that should be very low at this time.
MQK> Recoding will be slow if it's done on top of buffering and if encoding MQK> itself has heavy startup. Buffering should be on the very top, so it MQK> amortizes the cost of starting the recoder. MQK> It should be possible to use iconv for recoding. Iconv works on blocks MQK> and it should not be applied to one character at a time. recoding don't need any startup. each vGetChar or vPutChar just executes one or more vGetByte/vPutByte calls, according to encoding rules. this should be fast enough MQK> Byte streams and character streams should be distinguished in types, MQK> preferably by class-constrained parametric polymorphism. In particular so that vGetBuf, vGetChar, and getWord32 can't be used at the same stream? MQK> byte buffers and char buffers should be reperesented differently, MQK> so block copying between byte streams moves whole blocks of memory. in my lib, block copying between streams can be performed only at the byte level. if you need to copy N chars with non-trivial encoding, or need to copy chars with recoding, you should use "vGetChar h1 >>= vPutChar h2" operation repeated MQK> I have designed and implemented these issues for my language Kogut, MQK> and now I'm trying to port them to Haskell. MQK> Static typing gets in the way in various places (hiding the type MQK> behind an existentially qualified type, passing optional named MQK> arguments), but it looks doable. MQK> The design uses buffers internally. A buffer is a queue of characters MQK> or bytes, with amortized O(1) cost of operating at an end, and fast MQK> block operations. A buffer itself is a stream too (reading eats its MQK> contents from the beginning, writing appends to the end). i use different design. we can compare speed/features/robustness/easy of adding new features of our libraries. you can see my lib at: You can find further information about the library at the page http://haskell.org/haskellwiki/Library/Streams and download it as http://freearc.narod.ru/Streams.tar.gz -- Best regards, Bulat mailto:bulatz@HotPOP.com

Bulat Ziganshin
MQK> It should be possible to use iconv for recoding. Iconv works on MQK> blocks and it should not be applied to one character at a time.
recoding don't need any startup.
Calling iconv (or other similar routine) does need startup. And you really don't want to reimplement all encoders/decoders by hand in Haskell. Processing a stateful encoding needs the time to pick up the state and convert the materialized state into a form used during recoding Dispatching to the encoding function (usually not known statically) takes time. When we generically convert an encoder which fails for invalid data, to an encoder which replaces invalid data with U+FFFD or question marks, setting up exception handlers takes time. These are all little times, but they can be avoided. Converting newlines takes time, and it's very similar to character recoding. It should be done transparently; network protocols often use CR-LF newlines, and it's painful to remember to output a '\r' before every newline by hand. It should be done on top of character recoding; consider UTF-16, where newline conversion works in terms of characters rather than bytes. Some conversions can be implemented with tight loops which keep data in machine registers. The tightness matters when there are many iterations; loop startup is amortized by buffering. Buffering can provide arbitrarily far lookahead, arbitrarily long putback, and checking for end of stream while logically not moving the current position. But this works only if buffering is the last stage which changes stream contents.
MQK> Byte streams and character streams should be distinguished in types, MQK> preferably by class-constrained parametric polymorphism. In particular
so that vGetBuf, vGetChar, and getWord32 can't be used at the same stream?
You can get bytes from a given byte stream, and get bytes from a character stream put on top of that byte stream. Buf if the protocol mixes bytes with characters and is specified in terms of bytes, it's probably better to work in terms of bytes, and convert byte strings to character strings after determining where they end. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Bulat Ziganshin
recoding don't need any startup. each vGetChar or vPutChar just executes one or more vGetByte/vPutByte calls, according to encoding rules. this should be fast enough
Hmm, your interface for the encoder (String -> [Word8]) doesn't support stateful encodings at all, like ISO-2022-JP. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Bulat Ziganshin wrote:
Wednesday, February 08, 2006, 2:58:30 PM, you wrote: SM> I would prefer to see more type structure, rather than putting SM> everything in the Stream class. You have classes ByteStream, SM> BlockStream etc, but these are just renamings of the Stream class. There SM> are many compositions that are illegal, but we don't find out until SM> runtime; it would make a lot more sense to me to expose this structure SM> in the type system.
i initially used normal splitted classes (vGetBuf was in BlockStream) and so on, but come accross problems with the type classes system and decided to simplify the design. now i feel himself more confident with the classes, feel that i know source of my previous problems and therefore slowly migrate back to the splitted classes design. the library as published is just on the half of this way. but i know some limitations. that is the one problem:
data BinHandle = forall h . (Stream IO h) => BinH h
One possibility is something like this: data BinHandle = forall h . (Stream IO h, Typeable h) => BinH h then you can recover the original stream type (by guessing what it is). Or there are other solutions - adding an extra field to BinH, or separating the BinH constructor into two, one with a MemoryStream and one without. What's interesting is that you're saying you really want dynamic typing here - you don't want to distinguish different types of BinHandle, instead you want the saveToFile operation to fail at runtime if the wrong kind of stream is used. It's slightly strange to use dynamic typing here when the rest of the library would be using static typing, so it might be worthwhile considering static typing solutions instead: don't use an existential here, just add h as a parameter of BinHandle. This does mean you have to add Stream predicates a lot of places, though. Someday (soon, I hope), GHC will let you say data BinHandle h where BinH :: Stream IO h => BinH h but it doesn't work right now (or at least, it doesn't do what you want).
moreover, splitting the Streams interface will require from the library users to give more classes in defining context for their functions, like the:
process :: (Stream IO h, Seekable IO h, Buffered h) => h -> IO ()
that is not so good, especially if adding new interfaces means slowdown of calls to this function
you can combine multiple classes with a dummy superclass. If we get class synonyms (see Haskell' proposal) this will get easier. Performance of the example above might actually be better than having a single Stream class, depending on how much dictionary *building* needs to happen. In your library, every time an overloaded Stream function is called, it must be passed a dictionary for Stream, which is a tuple with 20+ elements. These dictionaries will probably be built at runtime, because of the superclass structure (the compiler usually won't be able to predict what layering of stream transformers will be used, and hence what dictionaries will be needed). You can provide some specialisations to help - SPECIALISE INSTANCE should be useful here. The point is that the performance implications aren't obvious, it depends a lot on how much sharing of dictionaries happens.
SM> Also I'd like to see separate SM> input/output streams for even more type safety, and I believe SM> simplicity,
it will be great! but it is very uneasy and even seems impossible:
1) this will prevent dividing streams into the MemoryStream/BlockStream/ByteStream, what i like you consider as more important. it is impossible to say what InputStream BlockStream implements only vGetBuf, while OutputStream BlockStream implements only vPutBuf operation
I don't think so - you just have InputByteStream/OutputByteStream classes, and similarly for the others.
2) such division will require to implement 2 or 3 (+ReadWrite) times more Stream types than now. Say, instead of FD we will get InputFD and OutputFD, instead of CharEncoding transformer - two transformers and so on. most of the functionality in Input and Ouput variants will be repeated (because this functionality don't depend on input/output mode) and in addition to the current large lists of passed calls like the:
vIsEOF (WithEncoding h _) = vIsEOF h vMkIOError (WithEncoding h _) = vMkIOError h vReady (WithEncoding h _) = vReady h vIsReadable (WithEncoding h _) = vIsReadable h
we will get the same lists in 2 or 3 repetitions!!!
the common operations should be members of a separate superclass. I have in mind a structure like this: class Stream h where streamEOF :: h -> IO Bool streamReady :: h -> IO Bool streamClose :: h -> IO () class InputByteStream h where streamGet :: h -> IO Word8 ... class InputBlockStream h where streamGetBuf :: h -> Int -> Ptr Word8 -> IO Int ... class InputMemoryStream h where streamGetMem :: h -> IO (Ptr Word8) ... there's no duplication, just more structure. By the way, I like your idea of exposing the difference between MemoryStream and ByteStream. I'm not so sure about the difference between BlockStream and ByteStream - I think BlockStream should be the lowest level, and all the ByteStream operations can be provided on BlockStreams (or MemoryStremas) by reading/writing one byte at a time.
3) i don't think that we can completely throw away the r/w streams, they can be required for example for database-style access.
This doesn't stop you from having read/write streams, but it means that you could implement read-only and write-only buffering without the complication that comes with the possibility of read/write. You have to implement read/write buffering separately from read-only and write-only buffering. For example, you could have an InOutBufferedStream transformer that layers on top of two underlying buffered streams, and remembers which one was used last. If an operation occurs on the other one, then the in-use buffer is flushed first. This means you only pay the penalty of checking & flushing when you need to use this read/write transformer, rather than in every buffered stream.
why ByteStream implements both the byte and text i/o? i think that in most cases people are still using latin-1 text i/o - i.e. each char is just 8 bits without any encoding. because that type of text i/o don't need any complex implememtation, each time when byte i/o is implemented, text i/o springs automatically. on the other side, utf-8 encoding is rare and therefore separate transformer is used to implement it - it transforms each char i/o call into several byte i/o calls.
i definitely against implementing text i/o only through the encoding transformer because it will slowdown the i/o while in 90% cases encoding will not be used.
I don't think this is the way to go. We should assume that text encoding/decoding is the norm, rather than optimising (heavily) for Latin-1. By all means have a specialised Latin-1 stream transformer, which can be very efficient, but don't build it into the byte stream class. Byte streams should deal in bytes, not Chars. UTF-8 might be rare at the moment, but it will be the norm soon. The library should provide fast text encoding/decoding, which means doing it via buffers, and probably using iconv, since we don't want to re-implement the encodings ourselves - for example what happens with UTF-8 encoding errors in your implementation? This gives us a dilemna - since encoding and decoding needs to operate directly on buffers, it needs direct access to the buffer. So it looks like encoding/decoding and buffering must be combined (this is what I did in my library). But, you don't want to add this buffering to a memory stream, so perhaps buffered streams should be instances of MemoryStreams, and text encoding/decoding should layer on MemoryStreams only? (is this what you did?)
SM> This will improve SM> performance too - your Stream class has dictionaries with 20+ elements.
here you are king - i don't know whether it's better to have one class with 20 methods or 2 classes with 5 methods each in the function context?
see above - I'd like to see some measurements here.
SM> I see that buffering works on vPutChar/vGetChar, and yet you seem to be SM> buffering bytes - which is it? Are you supposed to buffer before or SM> after doing character encoding? It seems before, because otherwise SM> buffering will strip out all but the low 8 bits of each character. SM> Using a more explicit type structure would help a lot here.
buffer contains bytes, which are read/written by the getbyte/putbyte operations as well as the all text i/o. this is latin1-only solution, of course. if one need utf-8 encoding, he need to apply CharEncoding transformer
see above - I think this is wrong. Cheers, Simon

Am Dienstag, 21. Februar 2006 13:06 schrieb Simon Marlow:
If we get class synonyms (see Haskell' proposal) this will get easier.
This raises two questions in me: 1. Is this the thing, John Meacham proposed some time ago? 2. What is the URL of the respective Haskell' proposal?
[...]
Best wishes, Wolfgang
participants (4)
-
Bulat Ziganshin
-
Marcin 'Qrczak' Kowalczyk
-
Simon Marlow
-
Wolfgang Jeltsch