Re: [Haskell-i18n] Surrogate pairs?

At 2002-08-20 04:34, Sven Moritz Hallberg wrote:
I see. I find it pretty inconvenient to read the incremental changes in the different Unicode revisions.
Me too. I have the big blue book (Unicode Standard 3.0), but I have to look at the updates for 3.1 and 3.2.
I've not been able to find the exact place where they clarify the situation with surrogate pairs. I suppose what they were is now only a facet of UTF-16, is that correct?
I believe so.
Anyway, as you put it, I take it that there should never be a character composed of two Chars.
That's not quite correct. Every code point is exactly one Char, but some characters may be composed of more than one code point. For instance, 'á' might be represented as \#00E1 [LATIN SMALL LETTER A WITH ACUTE] or \#0061 [LATIN SMALL LETTER A] + \#0301 [COMBINING ACUTE ACCENT]
The wording in the report about 16 bits will go, and the Int representation of Char uses Unicode scalar values.
Currently GHC restricts Chars to [0,0x10FFFF], for instance: Prelude> toEnum 0x0061 :: Char 'a' Prelude> toEnum 0x10FFFF :: Char '\1114111' Prelude> toEnum 0x110000 :: Char *** Exception: Prelude.chr: bad argument Prelude> I think this is correct behaviour. -- Ashley Yakeley, Seattle WA

Ashley Yakeley
That's not quite correct. Every code point is exactly one Char, but some characters may be composed of more than one code point. For instance, 'á' might be represented as
\#00E1 [LATIN SMALL LETTER A WITH ACUTE]
or
\#0061 [LATIN SMALL LETTER A] + \#0301 [COMBINING ACUTE ACCENT]
I guess they must be treated the same, too? That is, the length of the strings should be the same, they should compare equal, etc etc. Or is it an alternative to just ignore the issue, and simply think of the latter as two characters? -kzm -- If I haven't seen further, it is by standing in the footprints of giants

On Wed, 2002-08-21 at 02:07, Ashley Yakeley wrote:
Anyway, as you put it, I take it that there should never be a character composed of two Chars.
That's not quite correct. Every code point is exactly one Char, but some characters may be composed of more than one code point. For instance, 'á' might be represented as
\#00E1 [LATIN SMALL LETTER A WITH ACUTE]
or
\#0061 [LATIN SMALL LETTER A] + \#0301 [COMBINING ACUTE ACCENT]
Oh yes, my wording was inaccurate. I agree with what you say in your later message: These would be two different strings, seperate external functions should be used to compose/decompose characters.
The wording in the report about 16 bits will go, and the Int representation of Char uses Unicode scalar values.
Currently GHC restricts Chars to [0,0x10FFFF], for instance:
Oh, right, I hadn't even tried that. I had just noticed that Hugs rejects anything above \255.
Prelude> toEnum 0x0061 :: Char 'a' Prelude> toEnum 0x10FFFF :: Char '\1114111' Prelude> toEnum 0x110000 :: Char *** Exception: Prelude.chr: bad argument Prelude>
I think this is correct behaviour.
I agree. This reminds me that we have to spend some time thinking about what guarantees the report should make with respect to valid values a Char can have (think surrogates, noncharacters...). Regards, Sven Moritz
participants (3)
-
Ashley Yakeley
-
ketil@ii.uib.no
-
Sven Moritz Hallberg