Re: UTF-8 encode/decode libraries.

Duncan Coutts wrote:
On Mon, 2004-04-26 at 18:49, David Brown wrote: [...] toUTF :: String -> String
Hmmm, "String -> [Word8]" would be nicer...
fromUTF :: String -> String
... and here: "[Word8] -> String" or "[Word8] -> Maybe String". Furthermore, UTF-8 is not restricted to a maximum of 3 bytes per character, here an excerpt from "man utf8" on my SuSE Linux: * UTF-8 encoded UCS characters may be up to six bytes long, however the Unicode standard specifies no characters above 0x10ffff, so Unicode characters can only be up to four bytes long in UTF-8. IIRC we discussed encoders/decoders quite some time ago on the libraries mailing list, but nothing really happened, which is a pity. We should strive for something more general than UTF-8 <-> UCS/Unicode, there are quite a few more widely used encodings, e.g. GSM 03.38, etc. Any takers? Cheers, S.

On Mon, Apr 26, 2004 at 08:33:38PM +0200, Sven Panne wrote:
Duncan Coutts wrote:
On Mon, 2004-04-26 at 18:49, David Brown wrote: [...] toUTF :: String -> String
Hmmm, "String -> [Word8]" would be nicer...
fromUTF :: String -> String
... and here: "[Word8] -> String" or "[Word8] -> Maybe String".
Except that I would then have to come up with my own IO routines to read and write UTF data. With both sides as string, it is easy to just filter input and output of files. Dave
participants (2)
-
David Brown
-
Sven Panne