How does GHC read UNICODE.

L.Guo

20 May 2008 20 May '08

2:32 a.m.

Hi Haskellers: I am a Chinese. Mostly, it is needed to read/write UNICODE charactors. Currently, I can only use the ByteString module in GHC 6, 2007. But I feel it is not an easy method. Does GHC support it now ? or, is there any other way to do this ? Regards -------------- L.Guo 2008-05-20

Show replies by date

Don Stewart

20 May 20 May

2:43 a.m.

leaveye.guo:

...

Hi Haskellers:

I am a Chinese.

Mostly, it is needed to read/write UNICODE charactors.

Currently, I can only use the ByteString module in GHC 6, 2007. But I feel it is not an easy method.

Does GHC support it now ? or, is there any other way to do this ?

Regards

Hello! Chars in Haskell are 32 bit wide values, so they can happily accept various unicode encodings. The main issue then is doing IO with Chars. You can use either bytestrings, which will ignore any encoding, or the utf8-string package for Strings, which will properly encode and decode utf8 values to Char. http://hackage.haskell.org/cgi-bin/hackage-scripts/package/utf8-string -- Don

Ketil Malde

7:30 a.m.

Don Stewart writes:

...

You can use either bytestrings, which will ignore any encoding,

Uh, I am hesitant to voice my protest here, but I think this bears some elaboration: Bytestrings are exactly that, strings of bytes. There are basically two interfaces, one (Data.ByteString[.Lazy]), which operates on raw bytes (and gives you Word8s), and another (Data.ByteString[.Lazy].Char8), which treats the contents as Chars. The latter will only deal with Unicode code points 0..255 (or ISO_8859-1) -- and truncate higher code point values to fit this range. Basically, bytestrings are the wrong tool for the job if you need more than 8 bits per character. I think the predecessors of bytestring (FPS?) had support for other fixed-size encodings, that is, two-byte and four-byte characters. Perhaps writing a Data.Word16String bytestrings-alike using UCS-2 would be an option? -k -- If I haven't seen further, it is by standing in the footprints of giants

Duncan Coutts

11:03 a.m.

On Tue, 2008-05-20 at 09:30 +0200, Ketil Malde wrote:

...

Don Stewart writes:

...
You can use either bytestrings, which will ignore any encoding,

Uh, I am hesitant to voice my protest here, but I think this bears some elaboration:

Bytestrings are exactly that, strings of bytes.

Yes, we tried to make it explicit.

...

Basically, bytestrings are the wrong tool for the job if you need more than 8 bits per character.

Right. It's not intended for text, except for those 8-bit mixed binary ASCII network protocols, file formats etc.

...

I think the predecessors of bytestring (FPS?) had support for other fixed-size encodings, that is, two-byte and four-byte characters.

I'm not sure about that, but there is the old Data.PackedString which uses UTF-32. There is no fixed size two-byte Unicode encoding (there is only UTF-16 which is variable width.)

...

Perhaps writing a Data.Word16String bytestrings-alike using UCS-2 would be an option?

I'm supervising a masters student who is working on a proper Unicode ADT with a similar API and underlying implementation to that of ByteString. Hopefully people will be able to start using that for an internal representation of text instead of ByteString. Duncan

Olivier Boudry

1:11 p.m.

On Mon, May 19, 2008 at 10:32 PM, L.Guo wrote:

...

Does GHC support it now ? or, is there any other way to do this ?

Hi, The following blog article may help: http://blog.kfish.org/2007/10/survey-haskell-unicode-support.html It's a comparison of the different libraries for dealing with Unicode in Haskell (utf8 in source, iconv, utf8-string, encoding). Best regards, Olivier.

6258

Age (days ago)

6258

Last active (days ago)

List overview

Download

4 comments

5 participants

participants (5)

Don Stewart
Duncan Coutts
Ketil Malde
L.Guo
Olivier Boudry