
----- Original Message -----
From: "Dylan Thurston"
On Fri, Oct 05, 2001 at 11:23:50PM +1000, Andrew J Bromage wrote:
G'day all.
On Fri, Oct 05, 2001 at 02:29:51AM -0700, Krasimir Angelov wrote:
Why Char is 32 bit. UniCode characters is 16 bit.
It's not quite as simple as that. There is a set of one million (more correctly, 1M) Unicode characters which are only accessible using surrogate pairs (i.e. two UTF-16 codes). There are currently none of these codes assigned, and when they are, they'll be extremely rare. So rare, in fact, that the cost of strings taking up twice the space that the currently do simply isn't worth the cost.
This is no longer true, as of Unicode 3.1. Almost half of all characters currently assigned are outside of the BMP (i.e., require surrogate pairs in the UTF-16 encoding), including many Chinese characters. In current usage, these characters probably occur mainly in names, and are rare, but obviously important for the people involved.
In plane 2 (one of the surrogate planes) there are about 41000 Hàn characters, in addition to the about 27000 Hàn characters in the BMP. And more are expected to be encoded. However, IIRC, only about 6000-7000 of them are in modern use. I don't really want to push for them (since I think they are a major design mistake), but some people like them: the mathematical alphanumerical characters in plane 1. There are also the more likable (IMHO) musical characters in plane 1 ("western", though that attribute was removed, and Bysantine!). (You cannot set a musical score in Unicode plain text, it just encodes the characters that you can use IN a musical score.) ...
isAscii, isLatin1 - OK Yes, but why do (or, rather, did) you want them; isLatin1 in particuar? Then what about "isCP1252" (THE most common encoding today), "isShiftJis", etc., for several hundered encodings? (I'm not proposing to remove isAscii, but isLatin1 is dubious.)
isControl - I don't know about this. Why do (did) you want it? There are several "kinds" of "control" characters in Unicode: the traditional C0 and (less used) C1 ones, format control characters (NO, they do NOT control FORMATTING, though they do control FORMAT, like cursive connections), ...
isPrint - Dubious. Is a non-spacing accent a printable character? A combining character is most definitely "printable". (There is a difference between non-spacing and combining, even though many combining characters are non-spacing, not all of them are.)
isSpace - OK, by the comment in the report: "The isSpace function recognizes only white characters in the Latin-1 range". Sigh. There are several others, most importantly: LINE SEPARATOR, PARAGRAPH SEPARATOR, and IDEOGRAPHIC SPACE. And the NEL in the C1 range.
isUpper, isLower - Maybe OK. This is property interrogation. There are many other properties of interest.
toUpper, toLower - Not OK. There are cases where upper casing a character yields two characters. See my other e-mail.
etc. Any program using this library is bound to get confused on Unicode strings. Even before Unicode, there is much functionality missing; for instance, I don't see any way to compare strings using a localized order.
Is anyone working on honest support for Unicode, in the form of a real Unicode library with an interface at the correct level?
Well, IBM's ICU, for one, ... But they only do it for C/C++/Java, not for Haskell... Kind regards /kent k