
Binary i/o is not specifically a Haskell problem. Other programming systems, for example, Scheme, have been struggling with the same issues. Scheme Binary I/O proposal may therefore be of some interest in this group. http://srfi.schemers.org/srfi-56/ It is deliberately made to be the easiest to implement and the least controversial. There exist more ambitious proposals. The key feature is a layered, or stackable i/o. Enclosed is a justification message that I wrote two years ago for Haskell-Cafe, but somehow did not post. An early draft of the ambitious proposal, cast in the context of Scheme, is available here: http://pobox.com/~oleg/ftp/Scheme/io.txt More polished drafts exist, and even a prototype implementation. Unfortunately, once it became clear that the ideas are working out, the motivation fizzled. The discussion of i18n i/o highlighted the need for general overlay streams. We should be able to place a processing layer onto a handle -- and to peel it off and place another one. The layers can do character encoding, subranging (limiting the stream to the specified number of basic units), base64 and other decoding, signature collecting and verification, etc. Processing of a response from a web server illustrates a genuine need for such overlayed processing. Suppose we have established a connection to a web server, sent a GET or POST request and are now reading the response. It starts as follows: HTTP/1.1 200 Have it Content-type: text/plain; charset=iso-2022-jp Content-length: 12345 Date: Tuesday, August 13, 2002 <empty-line> To read the response line and the content headers, our stream must be in a ASCII, Latin-1 or UTF-8 encoding (regardless of the current locale). The body of the message is encoded in iso-2022-jp. This encoding may have nothing to do with the current locale. Furthermore, many character encodings cannot be reliably detected automatically. Therefore, after we read the headers we must forcibly set our stream to use the iso-2022-jp encoding. ISO-2022-JP is the encoding for Japanese characters [http://www.faqs.org/rfcs/rfc1554.html] It is a variable-length stateful encoding: the start of the specifically Japanese encoding is indicated by \e$B. After that, the reader should read _two octets_ from the input stream (and pass them to the application as they are). The server has indicated that it was sending 12345 _octets_ of data. We cannot tell offhand how many _characters_ of data we will read, because of the variable-length encoding. However, we must not even attempt to read the 12346-th octet: HTTP/1.1 connections are, in general, persistent, and we must not read more data than were sent. Otherwise, we deadlock. Therefore, our stream must be able to give us Japanese characters and still must count octets. The HTTP stream will not, in general, give EOF condition at the end of data. That is not the only complication. Suppose the web server replied: Content-type: text/plain; charset=iso-2022-jp Content-transfer-encoding: chunked We should expect to read a sequence of chunks of the format: <length> CRLF <body> CRLF where <length> is a hexadecimal number, and <body> is encoded as indicated in the Content-type. Therefore, after we read the header, we should keep our stream in the ASCII mode to read the <length> field. After that, we should switch the encoding into ISO-2022-JP. After we have consumed <length> octets, we should switch the stream back to ASCII, verify the trailing CRLF, and read the <length> of the next chunk. The ISO-2022-JP encoding is stateful and wide (a character is formed by several octets). It may well happen that a wide character is split between two chunks: one octet of a character will be in one chunk and the other octets in following chunk. Therefore, when we switch from the ISO-2022-JP encoding to ASCII and back, we must preserve the state of the encoding. This is not the end of the story however. A web server may send us a multi-part reply: a multi-part MIME entity made of several MIME entities, each with its own encoding and transfer modes. Neither of these encoding have anything to do with the current locale. Therefore, we may need to switch encodings back and forth quite a few times. Decoding of such complex streams becomes easier if we can overlay different processing layers on a stream. We start with a TCP handle, overlay an ASCII stream and read the headers, then overlay a stream that reads a specified number of units (and returns EOF when read that many). On the top of the latter we place an ISO-2022-JP decoder. Or we choose a base64-decoder overlayed with a PCS7 signed entity decoder and with a signature verification layer. OpenSSL is one package that offers i/o overlays and stream composition. Overlaying of parsers, encoders and hash accumulators is very common in that particular domain. I have implemented such a facility in two languages, e.g., to overlay an endian stream on the top of a raw stream, then a bit stream and an arithmetic compression stream. In a functional world, Ensemble system [http://citeseer.nj.nec.com/liu99building.html] supports such a stackable i/o. When using overlayed streams, we should remember that all the layers are synchronized. If we did let iso2022_stream = makeiso2022 hFile in body then the raw hFile is still available in the body. Reading the stream with hFile and iso2022_stream indiscriminately will sure wreck havoc. Clean's unique types is an excellent feature to guard against such mistakes. Exactly the same considerations apply if we were using TIFFreader or PNG reader rather than ISO-2022-JP.

oleg@pobox.com writes:
The discussion of i18n i/o highlighted the need for general overlay streams. We should be able to place a processing layer onto a handle -- and to peel it off and place another one. The layers can do character encoding, subranging (limiting the stream to the specified number of basic units), base64 and other decoding, signature collecting and verification, etc.
My language Kogut http://kokogut.sourceforge.net/ uses the following types: BYTE_INPUT - abstract supertype of a stream from which bytes can be read CHAR_INPUT, BYTE_OUTPUT, CHAR_OUTPUT - analogously The above types support i/o in blocks only (an array of bytes / chars at a time). In particular resizable byte arrays and character arrays are input and output streams. BYTE_INPUT_BUFFER - transforms a BYTE_INPUT to another BYTE_INPUT, providing buffering, unlimited lookahead and unlimited "unreading" (putback) CHAR_INPUT_BUFFER - analogously; in addition provides function which read a line at a time BYTE_OUTPUT_BUFFER - transforms a BYTE_OUTPUT to another BYTE_OUTPUT, providing buffering and explicit flushing CHAR_OUTPUT_BUFFER - analogously; in addition provides optional automatic flushing after outputting full lines The above types provide i/o in blocks and in individual characters, and in lines for character buffers. They should be used as the last component of a stack. BYTE_FILTER - defines how a sequence of bytes is transformed to another sequence of bytes, by providing a function which transforms a block at a time; it consumes some part of input, produces some part of output, and tells whether it stopped because it wants more input or because it wants more room in output; throws exception on errors CHAR_FILTER - analogously, but for characters ENCODER - analogously, but transforms characters into bytes DECODER - analogously, but transforms bytes into characters The above are only auxiliary types which just do the conversion on a block, not streams. BYTE_INPUT_FILTER - a byte input which uses another byte input and applies a byte filter to each block read CHAR_INPUT_FILTER - a char input which uses another char input and applies a char filter to each block read INPUT_DECODER - a char input which uses a byte input and applies a decoder to each block read The above types support i/o in blocks only. BYTE_OUTPUT_FILTER, CHAR_OUTPUT_FILTER, OUTPUT_ENCODER - analogously, but for output ENCODING - a supertype whic denotes an encoding in an abstract way. STRING is one of its subtypes (would be "instance" in Haskell) which currently means iconv-implemented encoding. There are also singleton types for important encodings implemented directly. There is a function which yields a new (stateful) encoder from an encoding, and another which yields a decoder, but encoding is what is used as an optional argument to the function which opens a file or converts between a standalone string and byte array. REPLACE_CODING_ERRORS - transforms an encoding to a related encoding which substitutes U+FFFD on decoding, and '?' on encoding, instead of throwing an exception on error. A similar transformer which e.g. produces 〹 for unencodable characters could be written too (not implemented yet). COPYING_FILTER - filter which dumps data passed through it to another output stream APPEND_INPUT - concatenates several input streams into one NULL_OUTPUT - /dev/null The above types come in BYTE and CHAR flavors. FLUSHING_OTHER - a byte input which reads data from another byte input, but flushes some specified output stream before each input operation; it's used on the *bottom* of stdin stack and flushes the *top* of stdout stack, so alternating input and output on stdin/stdout comes in the right order even if partial lines are output and without explicit flushing RAW_FILE - a byte input and output at the same time, a direct interface to the OS Some functions and other values: TextReader - transforms a byte input to a character input by stacking a decoder (for the specified or default encoding), a filter for newlines (not implemented yet), and char input buffer (with the specified or default buffer size) TextWriter - analogously, for output OpenRawFile, CreateRawFile - opens a raw file handle, has various options (read, write, create, truncate, exclusive, append, mode). OpenTextFile - a composition of OpenRawFile and TextReader which splits optional arguments to both, depending on where they apply CreateTextFile - a composition of CreateRawFile and TextWriter BinaryReader, BinaryWriter - only does buffering, has a slightly different interface than ByteInputBuffer and ByteOutputBuffer OpenBinaryFile, CreateBinaryFile - analogously RawStdIn, RawStdOut, RawStdErr - raw files StdOut - RawStdOut, transformed by TextWriter with automatic flushing after lines turned on (it's normally off by default) StdErr - similar StdIn - RawStdIn, transformed by FlushingOther on StdOut, transformed by TextReader At program exit StdOut and StdErr are flushed automatically. Some of these types would correspond to classes in Haskell, together with a type with existential qualifier. Representing streams as records of functions is not sufficient because a given type of streams may offer additional operations not provided by a generic interface. Byte and char versions would often be parametrized instead of using separate types. I didn't perform the on-the-fly translation of this description to Haskell idioms to avoid errors.
HTTP/1.1 200 Have it Content-type: text/plain; charset=iso-2022-jp Content-length: 12345 Date: Tuesday, August 13, 2002 <empty-line>
To read the response line and the content headers, our stream must be in a ASCII, Latin-1 or UTF-8 encoding (regardless of the current locale). The body of the message is encoded in iso-2022-jp.
It's tricky to implement that using my scheme because decoding is performed before buffering, so if we read it line by line and reach the end of headers, a part of the data has already been read and decoded using a wrong encoding. The simplest way is probably to apply buffering, use lookahead (an input buffer supports the interface of collections for lookahead) to locate the end of headers, move headers into a separate array of bytes leaving the rest in the buffered stream, put a text reader with the encoding set to Latin1 on the array with headers, parse headers, and put a text reader with the appropriate encoding on the rest of the stream. This causes double buffering of the rest of the stream, but avoiding it is harder and perhaps not worth the effort (requires peeking into the array used in buffers, to concatenate it with the rest of the stream). This leaves the problem with stopping the conversion after 12345 bytes. For that, if data needs to be processed lazily, I would implement a custom stream type which reads up to given number of bytes from an underlying stream and then signals end of data. It would be put below the decoder. After it finishes, the original stream is left in the right position and can be read further. If data doesn't need to be processed lazily, it's simpler: one can read 12345 bytes into an array and convert them off-line. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/
participants (2)
-
Marcin 'Qrczak' Kowalczyk
-
oleg@pobox.com