Re: Unicode support

9 Oct 2001


      ----- Original Message -----
From: "Ashley Yakeley" 
To: "Kent Karlsson" ; "Haskell List" ; "Libraries for Haskell List"

Sent: Tuesday, October 09, 2001 12:27 PM
Subject: Re: Unicode support
...
At 2001-10-09 02:58, Kent Karlsson wrote:
...
In summary:
code position (=code point): a value between 0000 and 10FFFF.
Would this be a reasonable basis for Haskell's 'Char' type?
Yes.  It's essentially UTF-32, but without the fixation to 32-bit
(21 bits suffice). UTF-32 (a.k.a. UCS-4 in 10646, yet to be limited
to 10FFFF instead of 31(!) bits) is the datatype used in some
implementations of C for wchar_t.  As I said in another e-mail,
if one does not have high efficiency concerns, UTF-32 is a rather
straighforward way of representing characters.
...
At some point
perhaps there should be a 'Unicode' standard library for Haskell. For
instance:
encodeUTF8 :: String -> [Word8];
decodeUTF8 :: [Word8] -> Maybe String;
encodeUTF16 :: String -> [Word16];
decodeUTF16 :: [Word16] -> Maybe String;
data GeneralCategory = Letter_Uppercase | Letter_Lowercase | ...
getGeneralCategory :: Char -> Maybe GeneralCategory;
There is not really any "Maybe" just there.  Yet unallocated code
positions have general category Cn (so do non-characters):
      Cs Other, Surrogate
      Co Other, Private Use
      Cn Other, Not Assigned (yet)
...
...sorting & searching...
...canonicalisation...
etc. Lots of work for someone.
Yes.  And it is lots of work (which is why I'm not volonteering
to make a qick fix: there is no quick fix).

        Kind regards
        /kent k