UTF-8 encode/decode libraries.

I am writing some utilities to deal with UTF-8 encoded text files (not source). Currently, I'm just reading in the UTF-8 directly, and things work reasonably well, since my parse tokens are ASCII, they are easy to parse. However, the character type seems perfectly happy with larger values for each character. Is anyone aware of any Haskell libraries for doing UTF-8 decoding and encoding? If not, I'll write something simple. Thanks, Dave Brown

On Mon, 2004-04-26 at 18:49, David Brown wrote:
Is anyone aware of any Haskell libraries for doing UTF-8 decoding and encoding? If not, I'll write something simple.
The gtk2hs library uses the following functions internally. Credit to Axel Simon I believe unless he swiped them from somewhere too. -- Convert Unicode characters to UTF-8. -- toUTF :: String -> String toUTF [] = [] toUTF (x:xs) | ord x<=0x007F = x:toUTF xs | ord x<=0x07FF = chr (0xC0 .|. ((ord x `shift` (-6)) .&. 0x1F)): chr (0x80 .|. (ord x .&. 0x3F)): toUTF xs | otherwise = chr (0xE0 .|. ((ord x `shift` (-12)) .&. 0x0F)): chr (0x80 .|. ((ord x `shift` (-6)) .&. 0x3F)): chr (0x80 .|. (ord x .&. 0x3F)): toUTF xs -- Convert UTF-8 to Unicode. -- fromUTF :: String -> String fromUTF [] = [] fromUTF (all@(x:xs)) | ord x<=0x7F = x:fromUTF xs | ord x<=0xBF = err | ord x<=0xDF = twoBytes all | ord x<=0xEF = threeBytes all | otherwise = err where twoBytes (x1:x2:xs) = chr (((ord x1 .&. 0x1F) `shift` 6) .|. (ord x2 .&. 0x3F)):fromUTF xs twoBytes _ = error "fromUTF: illegal two byte sequence" threeBytes (x1:x2:x3:xs) = chr (((ord x1 .&. 0x0F) `shift` 12) .|. ((ord x2 .&. 0x3F) `shift` 6) .|. (ord x3 .&. 0x3F)):fromUTF xs threeBytes _ = error "fromUTF: illegal three byte sequence" err = error "fromUTF: illegal UTF-8 character" Duncan

Duncan Coutts wrote:
On Mon, 2004-04-26 at 18:49, David Brown wrote: [...] toUTF :: String -> String
Hmmm, "String -> [Word8]" would be nicer...
fromUTF :: String -> String
... and here: "[Word8] -> String" or "[Word8] -> Maybe String". Furthermore, UTF-8 is not restricted to a maximum of 3 bytes per character, here an excerpt from "man utf8" on my SuSE Linux: * UTF-8 encoded UCS characters may be up to six bytes long, however the Unicode standard specifies no characters above 0x10ffff, so Unicode characters can only be up to four bytes long in UTF-8. IIRC we discussed encoders/decoders quite some time ago on the libraries mailing list, but nothing really happened, which is a pity. We should strive for something more general than UTF-8 <-> UCS/Unicode, there are quite a few more widely used encodings, e.g. GSM 03.38, etc. Any takers? Cheers, S.

On Mon, Apr 26, 2004 at 08:33:38PM +0200, Sven Panne wrote:
Duncan Coutts wrote:
On Mon, 2004-04-26 at 18:49, David Brown wrote: [...] toUTF :: String -> String
Hmmm, "String -> [Word8]" would be nicer...
fromUTF :: String -> String
... and here: "[Word8] -> String" or "[Word8] -> Maybe String".
Except that I would then have to come up with my own IO routines to read and write UTF data. With both sides as string, it is easy to just filter input and output of files. Dave

On 20040426T104946-0700, David Brown wrote:
Is anyone aware of any Haskell libraries for doing UTF-8 decoding and encoding? If not, I'll write something simple.
I wrote a simple Unicode library for my MSc project a couple of years ago. It might not compile with recent GHC, but you can have a look at http://savannah.nongnu.org/cgi-bin/viewcvs/ebba/ebba-h/ebba-unicode/ -- Antti-Juhani Kaijanaho, FM (MSc), http://www.mit.jyu.fi/antkaij/ ohjelmistotekniikan assistentti * assistant in software engineering Jyväskylän yliopisto * University of Jyväskylä Tietotekniikan laitos * Dept. of Mathematical Inf. Tech.
participants (4)
-
Antti-Juhani Kaijanaho
-
David Brown
-
Duncan Coutts
-
Sven Panne