Fri, 5 Oct 2001 02:29:51 -0700 (PDT), Krasimir Angelov
Why Char is 32 bit. UniCode characters is 16 bit.
No, Unicode characters have 21 bits (range U+0000..10FFFF). They used to fit in 16 bits a long time ago, and they are sometimes encoded as UTF-16 (each character occupies one or two 16-bit words). -- __("< Marcin Kowalczyk * qrczak@knm.org.pl http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTÊPCZA QRCZAK
"Marcin 'Qrczak' Kowalczyk"
Fri, 5 Oct 2001 02:29:51 -0700 (PDT), Krasimir Angelov
pisze: Why Char is 32 bit. UniCode characters is 16 bit.
No, Unicode characters have 21 bits (range U+0000..10FFFF).
We've been through all this, of course, but here's a quote:
"Unicode" originally implied that the encoding was UCS-2 and it initially didn't make any provisions for characters outside the BMP (U+0000 to U+FFFF). When it became clear that more than 64k characters would be needed for certain special applications (historic alphabets and ideographs, mathematical and musical typesetting, etc.), Unicode was turned into a sort of 21-bit character set with possible code points in the range U-00000000 to U-0010FFFF. The 2×1024 surrogate characters (U+D800 to U+DFFF) were introduced into the BMP to allow 1024×1024 non-BMP characters to be represented as a sequence of two 16-bit surrogate characters. This way UTF-16 was born, which represents the extended "21-bit" Unicode in a way backwards compatible with UCS-2. The term UTF-32 was introduced in Unicode to mean a 4-byte encoding of the extended "21-bit" Unicode. UTF-32 is the exact same thing as UCS-4, except that by definition UTF-32 is never used to represent characters above U-0010FFFF, while UCS-4 can cover all 231 code positions up to U-7FFFFFFF.
from a/the Unicode FAQ at http://www.cl.cam.ac.uk/~mgk25/unicode.html Does Haskell's support of "Unicode" mean UTF-32, or full UCS-4? Recent messages seem to indicate the former, but I don't see any reason against the latter. -kzm -- If I haven't seen further, it is by standing in the footprints of giants
participants (2)
-
Ketil Malde -
Marcin 'Qrczak' Kowalczyk