[Haskell-cafe] Re[2]: Streams: the extensible I/O library

10 Feb 2006

      Hello Simon,

Wednesday, February 08, 2006, 2:58:30 PM, you wrote:
SM> I would prefer to see more type structure, rather than putting
SM> everything in the Stream class.  You have classes ByteStream, 
SM> BlockStream etc, but these are just renamings of the Stream class. There 
SM> are many compositions that are illegal, but we don't find out until 
SM> runtime; it would make a lot more sense to me to expose this structure 
SM> in the type system.

i initially used normal splitted classes (vGetBuf was in BlockStream)
and so on, but come accross problems with the type classes system and
decided to simplify the design. now i feel himself more confident with
the classes, feel that i know source of my previous problems and
therefore slowly migrate back to the splitted classes design. the
library as published is just on the half of this way. but i know some
limitations. that is the one problem:

data BinHandle = forall h . (Stream IO h) => BinH h

with such definition, i cannot use for BinHandles any operations that
is aside of Stream interface. BinHandles, like in the NewBinary
library, can be constructed from files or memory buffers, and memory
buffers should support "saveToFile" operation. this operation require
the "MemoryStream" interface, what implement by memory buffers, but
not by files. in the old implementation, all operations was in the
Stream interface, so i can implement "saveToFile" and this operation
generated run-time error when used not with memory buffers. now it's
inpossible to use it, i need to add second field of type (Maybe
MemoryStream)

the same problem will appear for all "forall h . Stream h" datatypes -
if they need some operations from additional interfaces, then
additional fields should be introduced, quantified by these interfaces

moreover, splitting the Streams interface will require from the
library users to give more classes in defining context for their
functions, like the:

process :: (Stream IO h, Seekable IO h, Buffered h) => h -> IO ()

that is not so good, especially if adding new interfaces means
slowdown of calls to this function

SM> My view is that the most basic level of stream is a byte stream, 
SM> supporting only two operations: read an array of bytes and write an 
SM> array of bytes, i.e. vGetBuf/vPutBuf.  This makes implementing a stream, 
SM> or transformer, much easier and shorter.

it is exact what is implemented, except for: there are two types of
low-level streams. for memory-resident streams it is inefficient to
work through getbuf/putbuf operations. MemoryStream interface
impelements instead vReceiveBuf operation which just returns address
and size of next data block in memory. accordingly, the buffering
implementation slightly changes - it is the reason why i implemented
FileBuffering.hs (working through GetBuf/PutBuf) and
MemoryBuffering.hs (working through ReceiveBuf/SendBuf)

moreover, there are third type of streams - based on the
getchar/putchar or getbyte/putbyte operations. example of former is
StringBuffer, later - UArray Int Word8 (not implemented, but possible
in future)

SM> Also I'd like to see separate 
SM> input/output streams for even more type safety, and I believe 
SM> simplicity,

it will be great! but it is very uneasy and even seems impossible:

1) this will prevent dividing streams into the
MemoryStream/BlockStream/ByteStream, what i like you consider as more
important. it is impossible to say what InputStream BlockStream
implements only vGetBuf, while OutputStream BlockStream implements
only vPutBuf operation

2) such division will require to implement 2 or 3 (+ReadWrite) times
more Stream types than now. Say, instead of FD we will get InputFD and
OutputFD, instead of CharEncoding transformer - two transformers and
so on. most of the functionality in Input and Ouput variants will be
repeated (because this functionality don't depend on input/output
mode) and in addition to the current large lists of passed calls like
the:

    vIsEOF        (WithEncoding h _) = vIsEOF        h
    vMkIOError    (WithEncoding h _) = vMkIOError    h
    vReady        (WithEncoding h _) = vReady        h
    vIsReadable   (WithEncoding h _) = vIsReadable   h

we will get the same lists in 2 or 3 repetitions!!!

3) i don't think that we can completely throw away the r/w streams,
they can be required for example for database-style access. and if we
need to implement this type of streams, our win in separating
implementations of Input and Output streams will become a loss

moreover, difference between input and output streams are well-known
and errors in this area can be easily spotted by the users. so i think
that such division would be great, but it requires too much work and
will essentially compilcate the library. when the Haskell class system
will be essenially improved, it will have sense.

differences between MemoryStream, ByteStream and so on is not so
obvious (because it's specific to this library), so dividing them
should help users to spot errors earlier

SM> but this is less important than separating byte streams from 
SM> text streams.  I believe certain other operations would benefit from 
SM> being moved into separate classes: eg. vSeek into a Seekable class, 
SM> vSetBuffering into a Buffered class, and so on.

at this moment the following classes exists:

Stream
BlockStream (implements vGetBuf/vPutBuf, used for FD)
MemoryStream (implements vReceiveBuf/vSendBuf, used for MMFile and MemBuf)
ByteStream (implements vGetByte/vPutByte, vGetChar/vPutChar and other
text i/o operations)

Stream class implements all other operations (just now it implements
everything, but i will move the methods). i agree about moving
seek/tell operations into the separate class, but not sure about
buffering - its absence don't bother anyone and using the separate
class will just create problems. moreover, vSetBuffering is not
available for buffered MemoryStream. i prefer to distinguish buffered
streams by implementation of ByteStream interface - this interface
tells that byte- and text-oriented i/o is available, irrespective of
concrete implementation

why ByteStream implements both the byte and text i/o? i think that in
most cases people are still using latin-1 text i/o - i.e. each char is
just 8 bits without any encoding. because that type of text i/o don't
need any complex implememtation, each time when byte i/o is
implemented, text i/o springs automatically. on the other side, utf-8
encoding is rare and therefore separate transformer is used to
implement it - it transforms each char i/o call into several byte i/o
calls.

i definitely against implementing text i/o only through the encoding
transformer because it will slowdown the i/o while in 90% cases
encoding will not be used.

moreover, there are a 3 Stream types, developed by John Goerzen, where
Char is used as minimal unit of stream data. this implementation
allows us to use these streams to carry full unicode char in each Char
or use each Char as the 8-bit-only container.

if you, knowing all these, still recommend to change something, i'm
all ears :)

SM> This will improve 
SM> performance too - your Stream class has dictionaries with 20+ elements.

here you are king - i don't know whether it's better to have one class
with 20 methods or 2 classes with 5 methods each in the function
context?

SM> I see that buffering works on vPutChar/vGetChar, and yet you seem to be 
SM> buffering bytes - which is it?  Are you supposed to buffer before or 
SM> after doing character encoding?  It seems before, because otherwise 
SM> buffering will strip out all but the low 8 bits of each character. 
SM> Using a more explicit type structure would help a lot here.

buffer contains bytes, which are read/written by the getbyte/putbyte
operations as well as the all text i/o. this is latin1-only solution,
of course. if one need utf-8 encoding, he need to apply CharEncoding
transformer

SM> Incedentally, I'm suprised that you can use list-based character 
SM> encoding/decoding and still get good performance, I expected to need to 
SM> do encoding directly between buffers.

i reported only the speed of the buffering transformers. this don't
include speed of char encoding that should be very low at this time.
i want to try monadic operations here, and expect to get reasonable
performance, 20-30 mb/s:

-- | Read UTF-8 encoded char using `action` to get each byte
utf8Decode :: (Monad m) => m Word8 -> m Char
utf8Decode action = do c <- action
                       if c < 0x80 then return (chr c)
                       ....

-- | Write UTF-8 encoded char using `action` to put each byte
utf8Encode :: (Monad m) => Char -> (Word8 -> m ()) -> m ()
utf8Encode c action | ord c < 0x80 = action (ord c)
                    ....

vGetChar h = utf8Decode (vGetByte h)
vPutChar h c = utf8Encode c (vPutByte h)

what you will say about this plan? implementing this is much easier than
using iconv (although you are already implemented iconv usage), and
moreover it don't need presence of iconv. moreover, these monadic
converters is anyway required for the "instance Binary Char" implementation

SM> Still, as I said, I think the general approach is excellent, and is 
SM> definitely heading in the right direction.

SM> Oh, and some of the code is GPL'd, which is a problem for incorporation 
SM> in standard libraries.  This is just something to bear in mind if the 
SM> aim is for this to be a candidate for a/the standard IO library.

i will ask John Goerzen about this

-- 
Best regards,
 Bulat                            mailto:bulatz@HotPOP.com

[Haskell-cafe] Re[2]: Streams: the extensible I/O library

Bulat Ziganshin