implementation of UTF-8 conversion for text I/O: iconv vs hand-made

Hello all this letter describes why i think that using hand-made (de)coder for support of UTF-8 encoded files is better than using iconv. to let other readers know, iconv is wide-spread C library that performs buffer-to-buffer conversion between any text encodings (utf-8, utf-16, latin-1, ucs-2, ucs-4 and more). hand-made (en)coder implemented by me is just "converter", i.e. high-order function, between the getByte/putByte and getChar/putChar operations. so it can be used in any monad and with any purposes, not only for text I/O one can find example of library that uses iconv in the "System\IO\Text.hs" module from http://haskell.org/~simonmar/new-io.tar.gz and example of hand-made encoder in module "Data\CharEncoding.hs" and its usage - in "System\Stream\Transformer\CharEncoding.hs" from http://freearc.narod.ru/Streams.tar.gz i crossposted this letter to Marcin and Simon because you have discussed with me this question and to Einar because he once asked me about one specific feature in this area. why iconv is better: 1) it's lightning fast, making virtually zero speed overhead 2) it's robust 3) it contains already implemented and debugged algorithms for all possible encodings we can encounter 4) it has highly developed error processing facilities (i mean signalling about errors in input data and/or masking them) why hand-made conversion is better: 1) i don't know whether iconv will be available on every Hugs and GHC installation? 2) Einar once asked me about changing the encoding on the fly, that is needed for some HTML processing. it is also possible that some program will need to intersperse text I/O with buffer/array/byte/bits I/O. it's a sort of things that are absolutely impossible with iconv 3) my library support Streams that works in ANY monad (not only IO, ST and their derivatives). it's impossible to implement iconv conversion for such stream types as you can see, while the last arguments says about very specific situations, these situations absolutely can't be handled by iconv, so we need to implement hand-made conversions anyway. on the other side, iconv strong points don't have principal meaning - the speed with hand-made routines will be enough, about several mb/s; all possible encodings can be implemented and debugged sooner or later; only processing of errors in input data is weak point of the current design itself moreover, there are implementation issues that make me more enthusiastic about hand-made solution. it just already implemented and really works. implementation of the CharEncoding for streams is in module "System\Stream\Transformer\CharEncoding.hs", which is very trivial. implementation of different encoders in "Data\CharEncoding.hs" is slightly more complex, but these routines also used in "instance Binary String", i.e. to serialize strings. also, i think that "Data\CharEncoding.hs" module should be a part of standard Haskell library, so implementation of CharEncoding stream transformer is almost "free" on the other side, implementation of text encoding in "new I/O" library is about 1000 lines long. while i don't need to copy them all, using iconv anyway will be much more complex than using hand-made routines. this include complexity of interaction with iconv itself and complexity of implementing various I/O operations over the buffer that contains 4-byte characters. i already implemented 3 buffering transformers and adding one more buffering scheme is the last thing i want to do. vice versa - now i'm searching for ways to omit repetitions of code by joining them all into one. it's very boring - to have 3 or 4 similar things and replicate every change to them all at the same time, the library design is open and it's entirely possible to have two alternative char encoding transformers. everyone can develop additional transformers even without interaction with me - in this case, it should just implement vGetChar/bPutChar operations via the vGetBuf/vPutBuf ones. i just propose to leave the things as they are, and go to implementing of iconv-based transformer only when we will be actually bothered by it's restrictions -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 20.04 17:38, Bulat Ziganshin wrote:
one can find example of library that uses iconv in the "System\IO\Text.hs" module from http://haskell.org/~simonmar/new-io.tar.gz and example of hand-made encoder in module "Data\CharEncoding.hs" and its usage - in "System\Stream\Transformer\CharEncoding.hs" from http://freearc.narod.ru/Streams.tar.gz
Does Data.CharEncoding work with encodings that have state associated with them? One example is ISO-2022-JP. Maybe with using a suitable monad transformer?
2) Einar once asked me about changing the encoding on the fly, that is needed for some HTML processing. it is also possible that some program will need to intersperse text I/O with buffer/array/byte/bits I/O. it's a sort of things that are absolutely impossible with iconv
The example goes like this: 1) HTTP client reads response from server using ascii 2) When reading headers is complete, either: * decode body (binary data) and after decompressing convert to text * decode body (text in some encoding) straight from the Handle. Is there a reason this is impossible with iconv if the character conversion is on top of the buffering? - Einar Karttunen

Hello Einar, Thursday, April 20, 2006, 6:24:14 PM, you wrote:
Does Data.CharEncoding work with encodings that have state associated with them? One example is ISO-2022-JP.
no. so the list of things that is principal impossible with current design of Data.CharEncoding is error processing/masking and handling of stateful encodings
Maybe with using a suitable monad transformer?
how you imagine that? we have the following classes: class ByteStream m h where vGetByte :: h -> m Word8 vPutByte :: h -> Word8 -> m () class TextStream m h where vGetChar :: h -> m Char vPutChar :: h -> Char -> m () and char encoding transformer should implement later via former: instance ByteStream m h => TextStream m (CharEncoding h) where ... it seems that we should just improve type of (vGetByte->vGetChar) and (vPutByte->vPutChar) converters so that they will accept old state and error processing mode and returns error code and new state. smth like this: type PutByte m h = h -> Word8 -> m () type EncodeConverter m h state = PutByte m h -> ErrMode -> h -> state -> m (Either Char ErrCode, state) where `state` saves current processing state, Errmode is error processing mode and ErrCode is error code. of course, this should make implementation even slower :(
2) Einar once asked me about changing the encoding on the fly, that is needed for some HTML processing. it is also possible that some program will need to intersperse text I/O with buffer/array/byte/bits I/O. it's a sort of things that are absolutely impossible with iconv
The example goes like this: 1) HTTP client reads response from server using ascii 2) When reading headers is complete, either: * decode body (binary data) and after decompressing convert to text * decode body (text in some encoding) straight from the Handle.
Is there a reason this is impossible with iconv if the character conversion is on top of the buffering?
let's they answer :) i just want to mention to Simon that some apps want to use binary and text i/o at the same stream. if you think that HTTP has bad design, you know where to complain ;) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin
this letter describes why i think that using hand-made (de)coder for support of UTF-8 encoded files is better than using iconv.
A Haskell recoder is fine, and probably a good idea for important encodings, provided that wrapping a block recoder implemented in C is not ruled out. The two approaches should coexist.
2) Einar once asked me about changing the encoding on the fly, that is needed for some HTML processing.
HTML can be parsed by treating it as ISO-8859-1 first, looking for headers specifying the encoding only, and then converting the whole stream to the right encoding.
it is also possible that some program will need to intersperse text I/O with buffer/array/byte/bits I/O. it's a sort of things that are absolutely impossible with iconv
Of course it's possible. HTTP specifies that headers end with an empty line. The boundary can be found without decoding the text at all. Then the part before the boundary is treated as ASCII text and converted to strings, and the rest is binary. Or alternatively the text can be read by decoding one character at a time, and after the boundary is found, the rest cis read from the underlying binary stream. Even IConv can be used one character at a time, it will only be inefficient; but here ASCII can be implemented by hand. Emitting HTTP is analogous.
3) my library support Streams that works in ANY monad (not only IO, ST and their derivatives). it's impossible to implement iconv conversion for such stream types
Which is good. It's impossible to implement a stateful encoding in a monad which doesn't carry state.
moreover, there are implementation issues that make me more enthusiastic about hand-made solution. it just already implemented and really works.
Your implementation doesn't detect unencodable or malformed input. And I've already implemented both an IConv wrapper and some hand-written encodings (but not for Haskell). They work too :-)
using iconv anyway will be much more complex than using hand-made routines.
iconv is done once and tens of encodings become available at once. Each would have to be hand-implemented separately. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

FWIW, there's a fairly complete pure-Haskell UTF-8 converter implementation in HXML toolbox, which I "stole" and adapted for a version of HaXml; e.g.: http://www.ninebynine.org/Software/HaskellUtils/HaXml-1.12/src/Text/XML/HaXm... (Please ignore me if I miss your point.) #g -- Bulat Ziganshin wrote:
Hello all
this letter describes why i think that using hand-made (de)coder for support of UTF-8 encoded files is better than using iconv. to let other readers know, iconv is wide-spread C library that performs buffer-to-buffer conversion between any text encodings (utf-8, utf-16, latin-1, ucs-2, ucs-4 and more). hand-made (en)coder implemented by me is just "converter", i.e. high-order function, between the getByte/putByte and getChar/putChar operations. so it can be used in any monad and with any purposes, not only for text I/O
one can find example of library that uses iconv in the "System\IO\Text.hs" module from http://haskell.org/~simonmar/new-io.tar.gz and example of hand-made encoder in module "Data\CharEncoding.hs" and its usage - in "System\Stream\Transformer\CharEncoding.hs" from http://freearc.narod.ru/Streams.tar.gz
i crossposted this letter to Marcin and Simon because you have discussed with me this question and to Einar because he once asked me about one specific feature in this area.
why iconv is better:
1) it's lightning fast, making virtually zero speed overhead 2) it's robust 3) it contains already implemented and debugged algorithms for all possible encodings we can encounter 4) it has highly developed error processing facilities (i mean signalling about errors in input data and/or masking them)
why hand-made conversion is better:
1) i don't know whether iconv will be available on every Hugs and GHC installation?
2) Einar once asked me about changing the encoding on the fly, that is needed for some HTML processing. it is also possible that some program will need to intersperse text I/O with buffer/array/byte/bits I/O. it's a sort of things that are absolutely impossible with iconv
3) my library support Streams that works in ANY monad (not only IO, ST and their derivatives). it's impossible to implement iconv conversion for such stream types
as you can see, while the last arguments says about very specific situations, these situations absolutely can't be handled by iconv, so we need to implement hand-made conversions anyway. on the other side, iconv strong points don't have principal meaning - the speed with hand-made routines will be enough, about several mb/s; all possible encodings can be implemented and debugged sooner or later; only processing of errors in input data is weak point of the current design itself
moreover, there are implementation issues that make me more enthusiastic about hand-made solution. it just already implemented and really works. implementation of the CharEncoding for streams is in module "System\Stream\Transformer\CharEncoding.hs", which is very trivial. implementation of different encoders in "Data\CharEncoding.hs" is slightly more complex, but these routines also used in "instance Binary String", i.e. to serialize strings. also, i think that "Data\CharEncoding.hs" module should be a part of standard Haskell library, so implementation of CharEncoding stream transformer is almost "free"
on the other side, implementation of text encoding in "new I/O" library is about 1000 lines long. while i don't need to copy them all, using iconv anyway will be much more complex than using hand-made routines. this include complexity of interaction with iconv itself and complexity of implementing various I/O operations over the buffer that contains 4-byte characters. i already implemented 3 buffering transformers and adding one more buffering scheme is the last thing i want to do. vice versa - now i'm searching for ways to omit repetitions of code by joining them all into one. it's very boring - to have 3 or 4 similar things and replicate every change to them all
at the same time, the library design is open and it's entirely possible to have two alternative char encoding transformers. everyone can develop additional transformers even without interaction with me - in this case, it should just implement vGetChar/bPutChar operations via the vGetBuf/vPutBuf ones. i just propose to leave the things as they are, and go to implementing of iconv-based transformer only when we will be actually bothered by it's restrictions
-- Graham Klyne For email: http://www.ninebynine.org/#Contact
participants (4)
-
Bulat Ziganshin
-
Einar Karttunen
-
Graham Klyne
-
Marcin 'Qrczak' Kowalczyk