
Colin Paul Adams
Char is not an encoding, right?
Ivan> No, but in GHC at least it corresponds to a Unicode codepoint.
I don't think this is right, or shouldn't be right, anyway.. Surely it stands for a character. Unicode codepoints include non-characters such as the surrogate codepoints used by UTF-16 to map non-BMP codepoints to pairs of 16-bit codepoints.
Prelude> (toEnum 0xD800) :: Char '\55296'
I don't think you ought to be able to see a surrogate codepoint as a Char.
This is a bit confusing. From the Unicode glossary: - Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding. (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. (4) The English name for the ideographic written elements of Chinese origin. [See ideograph (2).] - Code Point. (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4, Characters and Encoding.) (2) A value, or position, for a character, in any coded character set.
From Wikipedia on UTF-16:
Unicode and ISO/IEC 10646 do not, and will never, assign characters to any of the code points in the U+D800–U+DFFF range, so an individual code unit from a surrogate pair does not ever represent a character. So: A Char holds a code point, that is, a value from 0 to 0x10FFFF16. Some of these values do not correspond to Unicode characters. As far as I can tell, a surrogate pair in UTF-16 is both two (surrogate) code points of two bytes each, as well as a single code point encoded as four bytes. Implementations seem to differ about what the length of a string containing surrogate pairs is. -k -- If I haven't seen further, it is by standing in the footprints of giants