
At 2002-08-19 17:06, Sven Moritz Hallberg wrote:
I just implemented a UTF-8 coder and decoder in Haskell. While reading the Unicode standard I realized what someone had pointed out earlier with respect to code values versus code points: Unicode, while "usually" using 16-bit words, supports "surrogate pairs" to handle all 31 bits of UCS-4.
The report says, Char is a 16-bit Unicode value.
Right, sec. 6.1.2. But this should change. A Char should allow (and only allow) values in the range [0,0x110000 - 1]. These are _Unicode scalar values_ as defined in the standard sec. 3.7, D28. Unicode scalar values are also known as "code positions" or "code points". The current version of GHC does precisely this. I don't think UCS-4 is used anymore, all character assignments are to code points, 0 to 0x10FFFF.
What's the stance on surrogate pairs? How are we going to support those? My code currently just errors "unsupported" when encountering a surrogate.
I think we should be working to the latest version of Unicode, 3.2.0. http://www.unicode.org/unicode/reports/tr28/ If your UTF-8 decoder comes across a sequence apparently representing a codepoint in the range [0xD800,0xDFFF], it should consider it "ill-formed". This is a new thing in 3.2. If Chars are code points rather than 16-bit code values, then when your UTF-8 decoder comes across a sequence representing a codepoint in the range [0x10000,0x10FFFF], it should represent it as a single Char, not as a surrogate pair of Chars. Surrogate pairs are for UTF-16, AFAIK they're not supposed to exist as code points. -- Ashley Yakeley, Seattle WA