New subject: Unicode support

9 Oct 2001


      At 2001-10-09 03:37, Kent Karlsson wrote:
...
...
...
code position (=code point): a value between 0000 and 10FFFF.
Would this be a reasonable basis for Haskell's 'Char' type?
Yes.  It's essentially UTF-32, but without the fixation to 32-bit
(21 bits suffice). UTF-32 (a.k.a. UCS-4 in 10646, yet to be limited
to 10FFFF instead of 31(!) bits) is the datatype used in some
implementations of C for wchar_t.  As I said in another e-mail,
if one does not have high efficiency concerns, UTF-32 is a rather
straighforward way of representing characters.
Would it be worthwhile restricting Char to the 0-10FFFF range, just as a 
Word8 is restricted to 0-FF even though in GHC at least it's stored 
32-bit?

...
...
...
data GeneralCategory = Letter_Uppercase | Letter_Lowercase | ...
getGeneralCategory :: Char -> Maybe GeneralCategory;
There is not really any "Maybe" just there.  Yet unallocated code
positions have general category Cn (so do non-characters):
     Cs Other, Surrogate
     Co Other, Private Use
     Cn Other, Not Assigned (yet)
OK. It occured to me to put 'unassigned' as Nothing, since it might 
change -- so in a sense getGeneralCategory doesn't know what the GC is. I 
assume once a codepoint has a non-Cn GC, it cannot be changed. But 
confusingly, some of the GCs are 'normative', whereas others are merely 
'informative' -- perhaps these last are subject to revision.

-- 
Ashley Yakeley, Seattle WA

Re: Unicode support

Ashley Yakeley

Marcin 'Qrczak' Kowalczyk

tags

participants (2)