
FWIW, there's a fairly complete pure-Haskell UTF-8 converter implementation in HXML toolbox, which I "stole" and adapted for a version of HaXml; e.g.: http://www.ninebynine.org/Software/HaskellUtils/HaXml-1.12/src/Text/XML/HaXm... (Please ignore me if I miss your point.) #g -- Bulat Ziganshin wrote:
Hello all
this letter describes why i think that using hand-made (de)coder for support of UTF-8 encoded files is better than using iconv. to let other readers know, iconv is wide-spread C library that performs buffer-to-buffer conversion between any text encodings (utf-8, utf-16, latin-1, ucs-2, ucs-4 and more). hand-made (en)coder implemented by me is just "converter", i.e. high-order function, between the getByte/putByte and getChar/putChar operations. so it can be used in any monad and with any purposes, not only for text I/O
one can find example of library that uses iconv in the "System\IO\Text.hs" module from http://haskell.org/~simonmar/new-io.tar.gz and example of hand-made encoder in module "Data\CharEncoding.hs" and its usage - in "System\Stream\Transformer\CharEncoding.hs" from http://freearc.narod.ru/Streams.tar.gz
i crossposted this letter to Marcin and Simon because you have discussed with me this question and to Einar because he once asked me about one specific feature in this area.
why iconv is better:
1) it's lightning fast, making virtually zero speed overhead 2) it's robust 3) it contains already implemented and debugged algorithms for all possible encodings we can encounter 4) it has highly developed error processing facilities (i mean signalling about errors in input data and/or masking them)
why hand-made conversion is better:
1) i don't know whether iconv will be available on every Hugs and GHC installation?
2) Einar once asked me about changing the encoding on the fly, that is needed for some HTML processing. it is also possible that some program will need to intersperse text I/O with buffer/array/byte/bits I/O. it's a sort of things that are absolutely impossible with iconv
3) my library support Streams that works in ANY monad (not only IO, ST and their derivatives). it's impossible to implement iconv conversion for such stream types
as you can see, while the last arguments says about very specific situations, these situations absolutely can't be handled by iconv, so we need to implement hand-made conversions anyway. on the other side, iconv strong points don't have principal meaning - the speed with hand-made routines will be enough, about several mb/s; all possible encodings can be implemented and debugged sooner or later; only processing of errors in input data is weak point of the current design itself
moreover, there are implementation issues that make me more enthusiastic about hand-made solution. it just already implemented and really works. implementation of the CharEncoding for streams is in module "System\Stream\Transformer\CharEncoding.hs", which is very trivial. implementation of different encoders in "Data\CharEncoding.hs" is slightly more complex, but these routines also used in "instance Binary String", i.e. to serialize strings. also, i think that "Data\CharEncoding.hs" module should be a part of standard Haskell library, so implementation of CharEncoding stream transformer is almost "free"
on the other side, implementation of text encoding in "new I/O" library is about 1000 lines long. while i don't need to copy them all, using iconv anyway will be much more complex than using hand-made routines. this include complexity of interaction with iconv itself and complexity of implementing various I/O operations over the buffer that contains 4-byte characters. i already implemented 3 buffering transformers and adding one more buffering scheme is the last thing i want to do. vice versa - now i'm searching for ways to omit repetitions of code by joining them all into one. it's very boring - to have 3 or 4 similar things and replicate every change to them all
at the same time, the library design is open and it's entirely possible to have two alternative char encoding transformers. everyone can develop additional transformers even without interaction with me - in this case, it should just implement vGetChar/bPutChar operations via the vGetBuf/vPutBuf ones. i just propose to leave the things as they are, and go to implementing of iconv-based transformer only when we will be actually bothered by it's restrictions
-- Graham Klyne For email: http://www.ninebynine.org/#Contact