Re: [Haskell-i18n] Surrogate pairs?

21 Aug 2002

      At 2002-08-20 04:34, Sven Moritz Hallberg wrote:
...
I see. I find it pretty inconvenient to read the incremental changes in
the different Unicode revisions.
Me too. I have the big blue book (Unicode Standard 3.0), but I have to 
look at the updates for 3.1 and 3.2.
...
I've not been able to find the exact
place where they clarify the situation with surrogate pairs. I suppose
what they were is now only a facet of UTF-16, is that correct?
I believe so.
...
Anyway, as you put it, I take it that there should never be a character
composed of two Chars.
That's not quite correct. Every code point is exactly one Char, but some 
characters may be composed of more than one code point. For instance, 'á' 
might be represented as

  \#00E1 [LATIN SMALL LETTER A WITH ACUTE]

or

  \#0061 [LATIN SMALL LETTER A] + \#0301 [COMBINING ACUTE ACCENT]
...
The wording in the report about 16 bits will go,
and the Int representation of Char uses Unicode scalar values.
Currently GHC restricts Chars to [0,0x10FFFF], for instance:

  Prelude> toEnum 0x0061 :: Char
  'a'
  Prelude> toEnum 0x10FFFF :: Char
  '\1114111'
  Prelude> toEnum 0x110000 :: Char
  *** Exception: Prelude.chr: bad argument
  Prelude> 

I think this is correct behaviour.

-- 
Ashley Yakeley, Seattle WA