
At 2002-08-20 04:34, Sven Moritz Hallberg wrote:
I see. I find it pretty inconvenient to read the incremental changes in the different Unicode revisions.
Me too. I have the big blue book (Unicode Standard 3.0), but I have to look at the updates for 3.1 and 3.2.
I've not been able to find the exact place where they clarify the situation with surrogate pairs. I suppose what they were is now only a facet of UTF-16, is that correct?
I believe so.
Anyway, as you put it, I take it that there should never be a character composed of two Chars.
That's not quite correct. Every code point is exactly one Char, but some characters may be composed of more than one code point. For instance, 'รก' might be represented as \#00E1 [LATIN SMALL LETTER A WITH ACUTE] or \#0061 [LATIN SMALL LETTER A] + \#0301 [COMBINING ACUTE ACCENT]
The wording in the report about 16 bits will go, and the Int representation of Char uses Unicode scalar values.
Currently GHC restricts Chars to [0,0x10FFFF], for instance: Prelude> toEnum 0x0061 :: Char 'a' Prelude> toEnum 0x10FFFF :: Char '\1114111' Prelude> toEnum 0x110000 :: Char *** Exception: Prelude.chr: bad argument Prelude> I think this is correct behaviour. -- Ashley Yakeley, Seattle WA