
oleg@pobox.com writes:
The discussion of i18n i/o highlighted the need for general overlay streams. We should be able to place a processing layer onto a handle -- and to peel it off and place another one. The layers can do character encoding, subranging (limiting the stream to the specified number of basic units), base64 and other decoding, signature collecting and verification, etc.
My language Kogut http://kokogut.sourceforge.net/ uses the following types: BYTE_INPUT - abstract supertype of a stream from which bytes can be read CHAR_INPUT, BYTE_OUTPUT, CHAR_OUTPUT - analogously The above types support i/o in blocks only (an array of bytes / chars at a time). In particular resizable byte arrays and character arrays are input and output streams. BYTE_INPUT_BUFFER - transforms a BYTE_INPUT to another BYTE_INPUT, providing buffering, unlimited lookahead and unlimited "unreading" (putback) CHAR_INPUT_BUFFER - analogously; in addition provides function which read a line at a time BYTE_OUTPUT_BUFFER - transforms a BYTE_OUTPUT to another BYTE_OUTPUT, providing buffering and explicit flushing CHAR_OUTPUT_BUFFER - analogously; in addition provides optional automatic flushing after outputting full lines The above types provide i/o in blocks and in individual characters, and in lines for character buffers. They should be used as the last component of a stack. BYTE_FILTER - defines how a sequence of bytes is transformed to another sequence of bytes, by providing a function which transforms a block at a time; it consumes some part of input, produces some part of output, and tells whether it stopped because it wants more input or because it wants more room in output; throws exception on errors CHAR_FILTER - analogously, but for characters ENCODER - analogously, but transforms characters into bytes DECODER - analogously, but transforms bytes into characters The above are only auxiliary types which just do the conversion on a block, not streams. BYTE_INPUT_FILTER - a byte input which uses another byte input and applies a byte filter to each block read CHAR_INPUT_FILTER - a char input which uses another char input and applies a char filter to each block read INPUT_DECODER - a char input which uses a byte input and applies a decoder to each block read The above types support i/o in blocks only. BYTE_OUTPUT_FILTER, CHAR_OUTPUT_FILTER, OUTPUT_ENCODER - analogously, but for output ENCODING - a supertype whic denotes an encoding in an abstract way. STRING is one of its subtypes (would be "instance" in Haskell) which currently means iconv-implemented encoding. There are also singleton types for important encodings implemented directly. There is a function which yields a new (stateful) encoder from an encoding, and another which yields a decoder, but encoding is what is used as an optional argument to the function which opens a file or converts between a standalone string and byte array. REPLACE_CODING_ERRORS - transforms an encoding to a related encoding which substitutes U+FFFD on decoding, and '?' on encoding, instead of throwing an exception on error. A similar transformer which e.g. produces 〹 for unencodable characters could be written too (not implemented yet). COPYING_FILTER - filter which dumps data passed through it to another output stream APPEND_INPUT - concatenates several input streams into one NULL_OUTPUT - /dev/null The above types come in BYTE and CHAR flavors. FLUSHING_OTHER - a byte input which reads data from another byte input, but flushes some specified output stream before each input operation; it's used on the *bottom* of stdin stack and flushes the *top* of stdout stack, so alternating input and output on stdin/stdout comes in the right order even if partial lines are output and without explicit flushing RAW_FILE - a byte input and output at the same time, a direct interface to the OS Some functions and other values: TextReader - transforms a byte input to a character input by stacking a decoder (for the specified or default encoding), a filter for newlines (not implemented yet), and char input buffer (with the specified or default buffer size) TextWriter - analogously, for output OpenRawFile, CreateRawFile - opens a raw file handle, has various options (read, write, create, truncate, exclusive, append, mode). OpenTextFile - a composition of OpenRawFile and TextReader which splits optional arguments to both, depending on where they apply CreateTextFile - a composition of CreateRawFile and TextWriter BinaryReader, BinaryWriter - only does buffering, has a slightly different interface than ByteInputBuffer and ByteOutputBuffer OpenBinaryFile, CreateBinaryFile - analogously RawStdIn, RawStdOut, RawStdErr - raw files StdOut - RawStdOut, transformed by TextWriter with automatic flushing after lines turned on (it's normally off by default) StdErr - similar StdIn - RawStdIn, transformed by FlushingOther on StdOut, transformed by TextReader At program exit StdOut and StdErr are flushed automatically. Some of these types would correspond to classes in Haskell, together with a type with existential qualifier. Representing streams as records of functions is not sufficient because a given type of streams may offer additional operations not provided by a generic interface. Byte and char versions would often be parametrized instead of using separate types. I didn't perform the on-the-fly translation of this description to Haskell idioms to avoid errors.
HTTP/1.1 200 Have it Content-type: text/plain; charset=iso-2022-jp Content-length: 12345 Date: Tuesday, August 13, 2002 <empty-line>
To read the response line and the content headers, our stream must be in a ASCII, Latin-1 or UTF-8 encoding (regardless of the current locale). The body of the message is encoded in iso-2022-jp.
It's tricky to implement that using my scheme because decoding is performed before buffering, so if we read it line by line and reach the end of headers, a part of the data has already been read and decoded using a wrong encoding. The simplest way is probably to apply buffering, use lookahead (an input buffer supports the interface of collections for lookahead) to locate the end of headers, move headers into a separate array of bytes leaving the rest in the buffered stream, put a text reader with the encoding set to Latin1 on the array with headers, parse headers, and put a text reader with the appropriate encoding on the rest of the stream. This causes double buffering of the rest of the stream, but avoiding it is harder and perhaps not worth the effort (requires peeking into the array used in buffers, to concatenate it with the rest of the stream). This leaves the problem with stopping the conversion after 12345 bytes. For that, if data needs to be processed lazily, I would implement a custom stream type which reads up to given number of bytes from an underlying stream and then signals end of data. It would be put below the decoder. After it finishes, the original stream is left in the right position and can be read further. If data doesn't need to be processed lazily, it's simpler: one can read 12345 bytes into an array and convert them off-line. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/