
On Wed, 2002-08-21 at 02:07, Ashley Yakeley wrote:
Anyway, as you put it, I take it that there should never be a character composed of two Chars.
That's not quite correct. Every code point is exactly one Char, but some characters may be composed of more than one code point. For instance, 'รก' might be represented as
\#00E1 [LATIN SMALL LETTER A WITH ACUTE]
or
\#0061 [LATIN SMALL LETTER A] + \#0301 [COMBINING ACUTE ACCENT]
Oh yes, my wording was inaccurate. I agree with what you say in your later message: These would be two different strings, seperate external functions should be used to compose/decompose characters.
The wording in the report about 16 bits will go, and the Int representation of Char uses Unicode scalar values.
Currently GHC restricts Chars to [0,0x10FFFF], for instance:
Oh, right, I hadn't even tried that. I had just noticed that Hugs rejects anything above \255.
Prelude> toEnum 0x0061 :: Char 'a' Prelude> toEnum 0x10FFFF :: Char '\1114111' Prelude> toEnum 0x110000 :: Char *** Exception: Prelude.chr: bad argument Prelude>
I think this is correct behaviour.
I agree. This reminds me that we have to spend some time thinking about what guarantees the report should make with respect to valid values a Char can have (think surrogates, noncharacters...). Regards, Sven Moritz