
At 2001-10-09 03:37, Kent Karlsson wrote:
code position (=code point): a value between 0000 and 10FFFF.
Would this be a reasonable basis for Haskell's 'Char' type?
Yes. It's essentially UTF-32, but without the fixation to 32-bit (21 bits suffice). UTF-32 (a.k.a. UCS-4 in 10646, yet to be limited to 10FFFF instead of 31(!) bits) is the datatype used in some implementations of C for wchar_t. As I said in another e-mail, if one does not have high efficiency concerns, UTF-32 is a rather straighforward way of representing characters.
Would it be worthwhile restricting Char to the 0-10FFFF range, just as a Word8 is restricted to 0-FF even though in GHC at least it's stored 32-bit? ...
data GeneralCategory = Letter_Uppercase | Letter_Lowercase | ... getGeneralCategory :: Char -> Maybe GeneralCategory;
There is not really any "Maybe" just there. Yet unallocated code positions have general category Cn (so do non-characters): Cs Other, Surrogate Co Other, Private Use Cn Other, Not Assigned (yet)
OK. It occured to me to put 'unassigned' as Nothing, since it might change -- so in a sense getGeneralCategory doesn't know what the GC is. I assume once a codepoint has a non-Cn GC, it cannot be changed. But confusingly, some of the GCs are 'normative', whereas others are merely 'informative' -- perhaps these last are subject to revision. -- Ashley Yakeley, Seattle WA

On Tue, 9 Oct 2001, Ashley Yakeley wrote:
Would it be worthwhile restricting Char to the 0-10FFFF range, just as a Word8 is restricted to 0-FF even though in GHC at least it's stored 32-bit?
It is thus restricted in GHC. I think it's a good compromise between 32-bit-Unicode and 16-bit-Unicode camps :-) -- Marcin 'Qrczak' Kowalczyk
participants (2)
-
Ashley Yakeley
-
Marcin 'Qrczak' Kowalczyk