Re: [Haskell-i18n] Surrogate pairs?

21 Aug 2002


      On Wed, 2002-08-21 at 02:07, Ashley Yakeley wrote:
...
...
Anyway, as you put it, I take it that there should never be a character
composed of two Chars.
That's not quite correct. Every code point is exactly one Char, but some 
characters may be composed of more than one code point. For instance, 'á' 
might be represented as
\#00E1 [LATIN SMALL LETTER A WITH ACUTE]
or
\#0061 [LATIN SMALL LETTER A] + \#0301 [COMBINING ACUTE ACCENT]
Oh yes, my wording was inaccurate. I agree with what you say in your
later message: These would be two different strings, seperate external
functions should be used to compose/decompose characters.
...
...
The wording in the report about 16 bits will go,
and the Int representation of Char uses Unicode scalar values.
Currently GHC restricts Chars to [0,0x10FFFF], for instance:
Oh, right, I hadn't even tried that. I had just noticed that Hugs
rejects anything above \255.
...
Prelude> toEnum 0x0061 :: Char
  'a'
  Prelude> toEnum 0x10FFFF :: Char
  '\1114111'
  Prelude> toEnum 0x110000 :: Char
  *** Exception: Prelude.chr: bad argument
  Prelude>
I think this is correct behaviour.
I agree. This reminds me that we have to spend some time thinking about
what guarantees the report should make with respect to valid values a
Char can have (think surrogates, noncharacters...).


Regards,
Sven Moritz

Re: [Haskell-i18n] Surrogate pairs?

Sven Moritz Hallberg