Re: Unicode support

newer
RE: bracket_

older
Re: Unicode support

Ashley Yakeley

9 Oct 2001 9 Oct '01

10:27 a.m.

At 2001-10-09 02:58, Kent Karlsson wrote:

...

In summary:

code position (=code point): a value between 0000 and 10FFFF.

Would this be a reasonable basis for Haskell's 'Char' type? At some point perhaps there should be a 'Unicode' standard library for Haskell. For instance: encodeUTF8 :: String -> [Word8]; decodeUTF8 :: [Word8] -> Maybe String; encodeUTF16 :: String -> [Word16]; decodeUTF16 :: [Word16] -> Maybe String; data GeneralCategory = Letter_Uppercase | Letter_Lowercase | ... getGeneralCategory :: Char -> Maybe GeneralCategory; ...sorting & searching... ...canonicalisation... etc. Lots of work for someone. -- Ashley Yakeley, Seattle WA

Show replies by date

Kent Karlsson

9 Oct 9 Oct

10:37 a.m.

New subject: Unicode support

----- Original Message ----- From: "Ashley Yakeley" To: "Kent Karlsson" ; "Haskell List" ; "Libraries for Haskell List" Sent: Tuesday, October 09, 2001 12:27 PM Subject: Re: Unicode support

...

At 2001-10-09 02:58, Kent Karlsson wrote:

...
In summary:

code position (=code point): a value between 0000 and 10FFFF.

Would this be a reasonable basis for Haskell's 'Char' type?

Yes. It's essentially UTF-32, but without the fixation to 32-bit (21 bits suffice). UTF-32 (a.k.a. UCS-4 in 10646, yet to be limited to 10FFFF instead of 31(!) bits) is the datatype used in some implementations of C for wchar_t. As I said in another e-mail, if one does not have high efficiency concerns, UTF-32 is a rather straighforward way of representing characters.

...

At some point perhaps there should be a 'Unicode' standard library for Haskell. For instance:

encodeUTF8 :: String -> [Word8]; decodeUTF8 :: [Word8] -> Maybe String; encodeUTF16 :: String -> [Word16]; decodeUTF16 :: [Word16] -> Maybe String;

data GeneralCategory = Letter_Uppercase | Letter_Lowercase | ... getGeneralCategory :: Char -> Maybe GeneralCategory;

There is not really any "Maybe" just there. Yet unallocated code positions have general category Cn (so do non-characters): Cs Other, Surrogate Co Other, Private Use Cn Other, Not Assigned (yet)

...

...sorting & searching...

...canonicalisation...

etc. Lots of work for someone.

Yes. And it is lots of work (which is why I'm not volonteering to make a qick fix: there is no quick fix). Kind regards /kent k

John Meacham

9:59 p.m.

New subject: Unicode support

On Tue, Oct 09, 2001 at 12:37:27PM +0200, Kent Karlsson wrote:

...

...
At 2001-10-09 02:58, Kent Karlsson wrote:

...
In summary: code position (=code point): a value between 0000 and 10FFFF. Would this be a reasonable basis for Haskell's 'Char' type?

Yes. It's essentially UTF-32, but without the fixation to 32-bit (21 bits suffice). UTF-32 (a.k.a. UCS-4 in 10646, yet to be limited to 10FFFF instead of 31(!) bits) is the datatype used in some implementations of C for wchar_t. As I said in another e-mail, if one does not have high efficiency concerns, UTF-32 is a rather straighforward way of representing characters.

I think that perhaps space efficiency concerns are moot anyway since Char's would probably be represented by possibly evaluated thunks anyway which I can't imagine being smaller than a pointer in general so for haskell the simplification of UTF-32 is most likely worth it. If space efficiency is a concern than I imagine people would want to use mutable arrays of bytes or words anyway (perhaps mmap'ed from a file) and not haskell lists of Chars.

...

...
At some point perhaps there should be a 'Unicode' standard library for Haskell. For instance:

encodeUTF8 :: String -> [Word8]; decodeUTF8 :: [Word8] -> Maybe String; encodeUTF16 :: String -> [Word16]; decodeUTF16 :: [Word16] -> Maybe String;

data GeneralCategory = Letter_Uppercase | Letter_Lowercase | ... getGeneralCategory :: Char -> Maybe GeneralCategory;

There is not really any "Maybe" just there. Yet unallocated code positions have general category Cn (so do non-characters): Cs Other, Surrogate Co Other, Private Use Cn Other, Not Assigned (yet)

...
...sorting & searching...

...canonicalisation...

etc. Lots of work for someone.

Yes. And it is lots of work (which is why I'm not volonteering to make a qick fix: there is no quick fix).

I think a cannonical way to get at iconvs ('man 3 iconv' for info.) functionality in one of the standard librarys would be great. perhaps I will have a go at it. even if the underlying platform does not have iconv then some basic conversions (utf8, utf16, latin1, [Char]) could easily be provided with the same API and minimal implementation effort. John -- --------------------------------------------------------------------------- John Meacham - California Institute of Technology, Alum. - john@repetae.net ---------------------------------------------------------------------------

8671

Age (days ago)

8671

Last active (days ago)

List overview

Download

2 comments

3 participants

participants (3)

Ashley Yakeley
John Meacham
Kent Karlsson