New subject: implementation of UTF-8 conversion for text I/O: iconv vs hand-made

20 Apr 2006

      Hello all

this letter describes why i think that using hand-made (de)coder for
support of UTF-8 encoded files is better than using iconv. to let
other readers know, iconv is wide-spread C library that performs
buffer-to-buffer conversion between any text encodings (utf-8, utf-16,
latin-1, ucs-2, ucs-4 and more). hand-made (en)coder implemented
by me is just "converter", i.e. high-order function, between the
getByte/putByte and getChar/putChar operations. so it can be used in
any monad and with any purposes, not only for text I/O

one can find example of library that uses iconv in the "System\IO\Text.hs"
module from http://haskell.org/~simonmar/new-io.tar.gz and example of
hand-made encoder in module "Data\CharEncoding.hs"
and its usage - in "System\Stream\Transformer\CharEncoding.hs"
from http://freearc.narod.ru/Streams.tar.gz

i crossposted this letter to Marcin and Simon because you have
discussed with me this question and to Einar because he once asked
me about one specific feature in this area.

why iconv is better:

1) it's lightning fast, making virtually zero speed overhead
2) it's robust
3) it contains already implemented and debugged algorithms for all
possible encodings we can encounter
4) it has highly developed error processing facilities
(i mean signalling about errors in input data and/or masking them)

why hand-made conversion is better:

1) i don't know whether iconv will be available on every Hugs and GHC
installation?

2) Einar once asked me about changing the encoding on the
fly, that is needed for some HTML processing. it is also possible that
some program will need to intersperse text I/O with
buffer/array/byte/bits I/O. it's a sort of things that are absolutely
impossible with iconv 

3) my library support Streams that works in ANY monad (not only IO, ST
and their derivatives). it's impossible to implement iconv conversion
for such stream types

as you can see, while the last arguments says about very specific
situations, these situations absolutely can't be handled by iconv, so
we need to implement hand-made conversions anyway. on the other side,
iconv strong points don't have principal meaning - the speed with
hand-made routines will be enough, about several mb/s; all possible
encodings can be implemented and debugged sooner or later; only
processing of errors in input data is weak point of the current design
itself

moreover, there are implementation issues that make me more enthusiastic
about hand-made solution. it just already implemented and really works.
implementation of the CharEncoding for streams is in module
"System\Stream\Transformer\CharEncoding.hs", which is very trivial.
implementation of different encoders in "Data\CharEncoding.hs"
is slightly more complex, but these routines also used in
"instance Binary String", i.e. to serialize strings. also, i think
that "Data\CharEncoding.hs" module should be a part of standard
Haskell library, so implementation of CharEncoding stream transformer
is almost "free"

on the other side, implementation of text encoding in "new I/O"
library is about 1000 lines long. while i don't need to copy them all,
using iconv anyway will be much more complex than using hand-made routines.
this include complexity of interaction with iconv itself and complexity of
implementing various I/O operations over the buffer that contains
4-byte characters. i already implemented 3 buffering transformers and
adding one more buffering scheme is the last thing i want to do. vice
versa - now i'm searching for ways to omit repetitions of code by joining
them all into one. it's very boring - to have 3 or 4 similar things
and replicate every change to them all

at the same time, the library design is open and it's entirely
possible to have two alternative char encoding transformers. everyone
can develop additional transformers even without interaction with me -
in this case, it should just implement vGetChar/bPutChar operations
via the vGetBuf/vPutBuf ones. i just propose to leave the things as
they are, and go to implementing of iconv-based transformer only when we
will be actually bothered by it's restrictions

-- 
Best regards,
 Bulat                          mailto:Bulat.Ziganshin@gmail.com

implementation of UTF-8 conversion for text I/O: iconv vs hand-made

Bulat Ziganshin

Einar Karttunen

Bulat Ziganshin

Marcin 'Qrczak' Kowalczyk

Graham Klyne

tags

participants (4)