Layered I/O

15 Sep 2004

      Binary i/o is not specifically a Haskell problem. Other programming
systems, for example, Scheme, have been struggling with the same
issues. Scheme Binary I/O proposal may therefore be of some interest
in this group.
	http://srfi.schemers.org/srfi-56/
It is deliberately made to be the easiest to implement and the least
controversial.

There exist more ambitious proposals. The key feature is a layered, or
stackable i/o. Enclosed is a justification message that I wrote two
years ago for Haskell-Cafe, but somehow did not post. An early draft
of the ambitious proposal, cast in the context of Scheme,
is available here:
	http://pobox.com/~oleg/ftp/Scheme/io.txt

More polished drafts exist, and even a prototype
implementation. Unfortunately, once it became clear that the ideas are
working out, the motivation fizzled.

The discussion of i18n i/o highlighted the need for general overlay
streams. We should be able to place a processing layer onto a handle
-- and to peel it off and place another one. The layers can do
character encoding, subranging (limiting the stream to the specified
number of basic units), base64 and other decoding, signature
collecting and verification, etc.

Processing of a response from a web server illustrates a genuine need for
such overlayed processing. Suppose we have established a connection to
a web server, sent a GET or POST request and are now reading the
response. It starts as follows:
	HTTP/1.1 200 Have it
	Content-type: text/plain; charset=iso-2022-jp
	Content-length: 12345
	Date: Tuesday, August 13, 2002
	<empty-line>

To read the response line and the content headers, our stream must be
in a ASCII, Latin-1 or UTF-8 encoding (regardless of the current
locale). The body of the message is encoded in iso-2022-jp. This
encoding may have nothing to do with the current locale. Furthermore,
many character encodings cannot be reliably detected
automatically. Therefore, after we read the headers we must forcibly
set our stream to use the iso-2022-jp encoding. ISO-2022-JP is the
encoding for Japanese characters
[http://www.faqs.org/rfcs/rfc1554.html] It is a variable-length
stateful encoding: the start of the specifically Japanese encoding is
indicated by \e$B.  After that, the reader should read _two octets_
from the input stream (and pass them to the application as they
are). The server has indicated that it was sending 12345 _octets_ of
data. We cannot tell offhand how many _characters_ of data we will
read, because of the variable-length encoding. However, we must not
even attempt to read the 12346-th octet: HTTP/1.1 connections are, in
general, persistent, and we must not read more data than were
sent. Otherwise, we deadlock. Therefore, our stream must be able to
give us Japanese characters and still must count octets. The HTTP
stream will not, in general, give EOF condition at the end of data.

That is not the only complication. Suppose the web server replied:
	Content-type: text/plain; charset=iso-2022-jp
	Content-transfer-encoding: chunked

We should expect to read a sequence of chunks of the format:
	<length> CRLF <body> CRLF
where <length> is a hexadecimal number, and <body> is encoded as
indicated in the Content-type. Therefore, after we read the header, we
should keep our stream in the ASCII mode to read the <length>
field. After that, we should switch the encoding into
ISO-2022-JP. After we have consumed <length> octets, we should switch
the stream back to ASCII, verify the trailing CRLF, and read the
<length> of the next chunk. The ISO-2022-JP encoding is stateful and
wide (a character is formed by several octets). It may well happen
that a wide character is split between two chunks: one octet of a
character will be in one chunk and the other octets in following
chunk. Therefore, when we switch from the ISO-2022-JP encoding to
ASCII and back, we must preserve the state of the encoding.

This is not the end of the story however. A web server may send us a
multi-part reply: a multi-part MIME entity made of several MIME
entities, each with its own encoding and transfer modes. Neither of
these encoding have anything to do with the current locale. Therefore,
we may need to switch encodings back and forth quite a few times.

Decoding of such complex streams becomes easier if we can overlay
different processing layers on a stream. We start with a TCP handle,
overlay an ASCII stream and read the headers, then overlay a stream
that reads a specified number of units (and returns EOF when read that
many). On the top of the latter we place an ISO-2022-JP decoder. Or we
choose a base64-decoder overlayed with a PCS7 signed entity decoder
and with a signature verification layer.

OpenSSL is one package that offers i/o overlays and stream
composition. Overlaying of parsers, encoders and hash accumulators is
very common in that particular domain. I have implemented such a
facility in two languages, e.g., to overlay an endian stream on the top
of a raw stream, then a bit stream and an arithmetic compression
stream. In a functional world, Ensemble system
[http://citeseer.nj.nec.com/liu99building.html] supports such a
stackable i/o.

When using overlayed streams, we should remember that all the layers
are synchronized. If we did
	let iso2022_stream = makeiso2022 hFile
	in body
then the raw hFile is still available in the body. Reading the stream
with hFile and iso2022_stream indiscriminately will sure wreck 
havoc. Clean's unique types is an excellent feature to guard against
such mistakes.

Exactly the same considerations apply if we were using TIFFreader or
PNG reader rather than ISO-2022-JP.

oleg＠pobox.com

Marcin 'Qrczak' Kowalczyk

tags

participants (2)