Re: [Haskell-i18n] Surrogate pairs?

20 Aug 2002

      At 2002-08-19 17:06, Sven Moritz Hallberg wrote:
...
I just implemented a UTF-8 coder and decoder in Haskell. While reading
the Unicode standard I realized what someone had pointed out earlier
with respect to code values versus code points: Unicode, while "usually"
using 16-bit words, supports "surrogate pairs" to handle all 31 bits of
UCS-4.
The report says, Char is a 16-bit Unicode value.
Right, sec. 6.1.2. But this should change. A Char should allow (and only 
allow) values in the range [0,0x110000 - 1]. These are _Unicode scalar 
values_ as defined in the standard sec. 3.7, D28. Unicode scalar values 
are also known as "code positions" or "code points".

The current version of GHC does precisely this. I don't think UCS-4 is 
used anymore, all character assignments are to code points, 0 to 0x10FFFF.
...
What's the stance on surrogate pairs?
How are we going to support those? My code currently
just errors "unsupported" when encountering a surrogate.
I think we should be working to the latest version of Unicode, 3.2.0. 
http://www.unicode.org/unicode/reports/tr28/

If your UTF-8 decoder comes across a sequence apparently representing a 
codepoint in the range [0xD800,0xDFFF], it should consider it 
"ill-formed". This is a new thing in 3.2.

If Chars are code points rather than 16-bit code values, then when your 
UTF-8 decoder comes across a sequence representing a codepoint in the 
range [0x10000,0x10FFFF], it should represent it as a single Char, not as 
a surrogate pair of Chars. Surrogate pairs are for UTF-16, AFAIK they're 
not supposed to exist as code points.

-- 
Ashley Yakeley, Seattle WA