UTF-8 encode/decode libraries. - Glasgow-haskell-users - Haskell.org

newer
REMINDER - Contributions to HC&A...

UTF-8 encode/decode libraries.

older
RE: GHC 6.2.1 retaining profiling...

David Brown

27 Apr 2004 27 Apr '04

1:49 a.m.

I am writing some utilities to deal with UTF-8 encoded text files (not source). Currently, I'm just reading in the UTF-8 directly, and things work reasonably well, since my parse tokens are ASCII, they are easy to parse. However, the character type seems perfectly happy with larger values for each character. Is anyone aware of any Haskell libraries for doing UTF-8 decoding and encoding? If not, I'll write something simple. Thanks, Dave Brown

Reply

Sign in to reply online Use email software

Show replies by date

Duncan Coutts

27 Apr 27 Apr

2:05 a.m.

On Mon, 2004-04-26 at 18:49, David Brown wrote:

Is anyone aware of any Haskell libraries for doing UTF-8 decoding and encoding? If not, I'll write something simple.

The gtk2hs library uses the following functions internally. Credit to Axel Simon I believe unless he swiped them from somewhere too. -- Convert Unicode characters to UTF-8. -- toUTF :: String -> String toUTF [] = [] toUTF (x:xs) | ord x<=0x007F = x:toUTF xs | ord x<=0x07FF = chr (0xC0 .|. ((ord x `shift` (-6)) .&. 0x1F)): chr (0x80 .|. (ord x .&. 0x3F)): toUTF xs | otherwise = chr (0xE0 .|. ((ord x `shift` (-12)) .&. 0x0F)): chr (0x80 .|. ((ord x `shift` (-6)) .&. 0x3F)): chr (0x80 .|. (ord x .&. 0x3F)): toUTF xs -- Convert UTF-8 to Unicode. -- fromUTF :: String -> String fromUTF [] = [] fromUTF (all@(x:xs)) | ord x<=0x7F = x:fromUTF xs | ord x<=0xBF = err | ord x<=0xDF = twoBytes all | ord x<=0xEF = threeBytes all | otherwise = err where twoBytes (x1:x2:xs) = chr (((ord x1 .&. 0x1F) `shift` 6) .|. (ord x2 .&. 0x3F)):fromUTF xs twoBytes _ = error "fromUTF: illegal two byte sequence" threeBytes (x1:x2:x3:xs) = chr (((ord x1 .&. 0x0F) `shift` 12) .|. ((ord x2 .&. 0x3F) `shift` 6) .|. (ord x3 .&. 0x3F)):fromUTF xs threeBytes _ = error "fromUTF: illegal three byte sequence" err = error "fromUTF: illegal UTF-8 character" Duncan

Reply

Sign in to reply online Use email software

Sven Panne

2:33 a.m.

Duncan Coutts wrote:

On Mon, 2004-04-26 at 18:49, David Brown wrote: [...] toUTF :: String -> String

Hmmm, "String -> [Word8]" would be nicer...

fromUTF :: String -> String

... and here: "[Word8] -> String" or "[Word8] -> Maybe String". Furthermore, UTF-8 is not restricted to a maximum of 3 bytes per character, here an excerpt from "man utf8" on my SuSE Linux: * UTF-8 encoded UCS characters may be up to six bytes long, however the Unicode standard specifies no characters above 0x10ffff, so Unicode characters can only be up to four bytes long in UTF-8. IIRC we discussed encoders/decoders quite some time ago on the libraries mailing list, but nothing really happened, which is a pity. We should strive for something more general than UTF-8 <-> UCS/Unicode, there are quite a few more widely used encodings, e.g. GSM 03.38, etc. Any takers? Cheers, S.

Reply

Sign in to reply online Use email software

David Brown

3:33 a.m.

On Mon, Apr 26, 2004 at 08:33:38PM +0200, Sven Panne wrote:

Duncan Coutts wrote:

...
On Mon, 2004-04-26 at 18:49, David Brown wrote: [...] toUTF :: String -> String

Hmmm, "String -> [Word8]" would be nicer...

...
fromUTF :: String -> String

... and here: "[Word8] -> String" or "[Word8] -> Maybe String".

Except that I would then have to come up with my own IO routines to read and write UTF data. With both sides as string, it is easy to just filter input and output of files. Dave

Reply

Sign in to reply online Use email software

Antti-Juhani Kaijanaho

3 May 3 May

6:28 p.m.

On 20040426T104946-0700, David Brown wrote:

Is anyone aware of any Haskell libraries for doing UTF-8 decoding and encoding? If not, I'll write something simple.

I wrote a simple Unicode library for my MSc project a couple of years ago. It might not compile with recent GHC, but you can have a look at http://savannah.nongnu.org/cgi-bin/viewcvs/ebba/ebba-h/ebba-unicode/ -- Antti-Juhani Kaijanaho, FM (MSc), http://www.mit.jyu.fi/antkaij/ ohjelmistotekniikan assistentti * assistant in software engineering Jyväskylän yliopisto * University of Jyväskylä Tietotekniikan laitos * Dept. of Mathematical Inf. Tech.

Reply

Sign in to reply online Use email software

8122

Age (days ago)

8129

Last active (days ago)

Download

4 comments

4 participants

tags

participants (4)

Antti-Juhani Kaijanaho
David Brown
Duncan Coutts
Sven Panne