
On Fri, Oct 05, 2001 at 11:23:50PM +1000, Andrew J Bromage wrote:
G'day all.
On Fri, Oct 05, 2001 at 02:29:51AM -0700, Krasimir Angelov wrote:
Why Char is 32 bit. UniCode characters is 16 bit.
It's not quite as simple as that. There is a set of one million (more correctly, 1M) Unicode characters which are only accessible using surrogate pairs (i.e. two UTF-16 codes). There are currently none of these codes assigned, and when they are, they'll be extremely rare. So rare, in fact, that the cost of strings taking up twice the space that the currently do simply isn't worth the cost.
This is no longer true, as of Unicode 3.1. Almost half of all characters currently assigned are outside of the BMP (i.e., require surrogate pairs in the UTF-16 encoding), including many Chinese characters. In current usage, these characters probably occur mainly in names, and are rare, but obviously important for the people involved.
However, you still need to be able to handle them. I don't know what the "official" Haskell reasoning is (it may have more to do with word size than Unicode semantics), but it makes sense to me to store single characters in UTF-32 but strings in a more compressed format (UTF-8 or UTF-16).
Haskell already stores strings as lists of characters, so I see no advantage to anything other than UTF-32, since they'll take up a full machine word in any case. (Right?) There's even plenty of room for tags if any implementations want to use it.
See also: http://www.unicode.org/unicode/faq/utf_bom.html
It just goes to show that strings are not merely arrays of characters like some languages would have you believe.
Right. In Unicode, the concept of a "character" is not really so useful; most functions that traditionally operate on characters (e.g., uppercase or display-width) fundamentally need to operate on strings. (This is due to properties of particular languages, not any design flaw of Unicode.) Err, this raises some questions as to just what the "Char" module from the standard library is supposed to do. Most of the functions are just not well-defined: isAscii, isLatin1 - OK isControl - I don't know about this. isPrint - Dubious. Is a non-spacing accent a printable character? isSpace - OK, by the comment in the report: "The isSpace function recognizes only white characters in the Latin-1 range". isUpper, isLower - Maybe OK. toUpper, toLower - Not OK. There are cases where upper casing a character yields two characters. etc. Any program using this library is bound to get confused on Unicode strings. Even before Unicode, there is much functionality missing; for instance, I don't see any way to compare strings using a localized order. Is anyone working on honest support for Unicode, in the form of a real Unicode library with an interface at the correct level? Best, Dylan Thurston

There are other issues involved too. For instance, normalisation. Also the question of reading files - will the runtime auto-detect the encoding of the file, and translate to Unicode? How will applications find out what the original encoding was? How will applications be able to set the encoding for output files (for instance, ans XSLT processor, which I am planning to write in Haskell, must honour a request to write the output file in either utf-8 or utf-16)? I suspect a Unicode library will be needed. -- Colin Paul Adams Preston Lancashire

Dylan Thurston
Right. In Unicode, the concept of a "character" is not really so useful;
After reading a bit about it, I'm certainly confused. Unicode/ISO-10646 contains a lot of things that aren'r really one character, e.g. ligatures.
most functions that traditionally operate on characters (e.g., uppercase or display-width) fundamentally need to operate on strings. (This is due to properties of particular languages, not any design flaw of Unicode.)
I think an argument could be put forward that Unicode is trying to be more than just a character set. At least at first glance, it seems to try to be both a character set and a glyph map, and incorporate things like transliteration between character sets (or subsets, now that Unicode contains them all), directionality of script, and so on.
toUpper, toLower - Not OK. There are cases where upper casing a character yields two characters.
I though title case was supposed to handle this. I'm probably confused, though.
etc. Any program using this library is bound to get confused on Unicode strings. Even before Unicode, there is much functionality missing; for instance, I don't see any way to compare strings using a localized order.
And you can't really use list functions like "length" on strings, since one item can be two characters (Lj, ij, fi) and several items can compose one character (combining characters). And "map (==)" can't compare two Strings since, e.g. in the presence of combining characters. How are other systems handling this? It may be that Unicode isn't flawed, but it's certainly extremely complex. I guess I'll have to delve a bit deeper into it. -kzm -- If I haven't seen further, it is by standing in the footprints of giants

----- Original Message -----
From: "Ketil Malde"
Dylan Thurston
writes: Right. In Unicode, the concept of a "character" is not really so useful;
After reading a bit about it, I'm certainly confused. Unicode/ISO-10646 contains a lot of things that aren'r really one character, e.g. ligatures.
The ligatures that are included are there for compatiblity with older character encodings. Normally, for modern technology..., ligatures are (to be) formed automatically through the font. OpenType (OT, MS and Adobe) and AAT (Apple) have support for this. There are often requests to add more ligatures to 10646/Unicode, but they are rejected since 10646/Unicode encode characters, not glyphs. (With two well-known exceptions: for compatibility, and certain dingbats.)
most functions that traditionally operate on characters (e.g., uppercase or display-width) fundamentally need to operate on strings. (This is due to properties of particular languages, not any design flaw of Unicode.)
I think an argument could be put forward that Unicode is trying to be more than just a character set. At least at first glance, it seems to
Yes, but:
try to be both a character set and a glyph map, and incorporate things
not that. See above.
like transliteration between character sets (or subsets, now that Unicode contains them all), directionality of script, and so on.
Unicode (but not 10646) does handle bidirectionality (seeUAX 9: http://www.unicode.org/unicode/reports/tr9/), but not transliteration. (Tranliteration is handled in IBMs ICU, though: http://www-124.ibm.com/developerworks/oss/icu4j/index.html)
toUpper, toLower - Not OK. There are cases where upper casing a character yields two characters.
I though title case was supposed to handle this. I'm probably confused, though.
The titlecase characters in Unicode are (essentially) only there for compatibility reasons (originally for transliterating between certain subsets of Cyrillic and Latin scripts in a 1-1 way). You're not supposed to really use them... The cases where toUpper of a single character give two characters is for some (classical) Greek, where a builtin subscript iota turn into a capital iota, and other cases where there is no corresponding uppercase letter. It is also the case that case mapping is context sensitive. E.g. mapping capital sigma to small sigma (mostly) or ς (small final sigma) (at end of word), or the capital i to ı (small dotless i), if Turkish, or insert/ delete combining dot above for i and j in Lithuanian. See UTR 21 and http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt.
etc. Any program using this library is bound to get confused on Unicode strings. Even before Unicode, there is much functionality missing; for instance, I don't see any way to compare strings using a localized order.
And you can't really use list functions like "length" on strings, since one item can be two characters (Lj, ij, fi) and several items can compose one character (combining characters).
Depends on what you mean by "lenght" and "character"... You seem to be after what is sometimes referred to as "grapheme", and counting those. There is a proposal for a definition of "language independent grapheme" (with lexical syntax), but I don't think it is stable yet.
And "map (==)" can't compare two Strings since, e.g. in the presence of combining characters. How are other systems handling this?
I guess it is not very systematic. Java and XML make the comparisons directly by equality of the 'raw' characters *when* comparing identifiers/similar, though for XML there is a proposal for "early normalisation" essentially to NFC (normal form C). I would have preferred comparing the normal forms of the identifiers instead. For searches, the recommendation (though I doubt in practice yet) is to use a collation key based comparison. (Note that collation keys are usually language dependent. More about collation in UTS 10, http://www.unicode.org/unicode/reports/tr10/, and ISO/IEC 14651.) What does NOT make sense is to expose (to a user) the raw ordering (<) of Unicode strings, though it may be useful internally. Orders exposed to people (or other systems, for that matter) that are't concerned with the inner workings of a program should always be collation based. (But that holds for any character encoding, it's just more apparent for Unicode.)
It may be that Unicode isn't flawed, but it's certainly extremely complex. I guess I'll have to delve a bit deeper into it.
It's complex, but it is because the scripts of world are complex (and add to that politics, as well as compatbility and implementation issues). Kind regards /kent k

----- Original Message -----
From: "Dylan Thurston"
On Fri, Oct 05, 2001 at 11:23:50PM +1000, Andrew J Bromage wrote:
G'day all.
On Fri, Oct 05, 2001 at 02:29:51AM -0700, Krasimir Angelov wrote:
Why Char is 32 bit. UniCode characters is 16 bit.
It's not quite as simple as that. There is a set of one million (more correctly, 1M) Unicode characters which are only accessible using surrogate pairs (i.e. two UTF-16 codes). There are currently none of these codes assigned, and when they are, they'll be extremely rare. So rare, in fact, that the cost of strings taking up twice the space that the currently do simply isn't worth the cost.
This is no longer true, as of Unicode 3.1. Almost half of all characters currently assigned are outside of the BMP (i.e., require surrogate pairs in the UTF-16 encoding), including many Chinese characters. In current usage, these characters probably occur mainly in names, and are rare, but obviously important for the people involved.
In plane 2 (one of the surrogate planes) there are about 41000 Hàn characters, in addition to the about 27000 Hàn characters in the BMP. And more are expected to be encoded. However, IIRC, only about 6000-7000 of them are in modern use. I don't really want to push for them (since I think they are a major design mistake), but some people like them: the mathematical alphanumerical characters in plane 1. There are also the more likable (IMHO) musical characters in plane 1 ("western", though that attribute was removed, and Bysantine!). (You cannot set a musical score in Unicode plain text, it just encodes the characters that you can use IN a musical score.) ...
isAscii, isLatin1 - OK Yes, but why do (or, rather, did) you want them; isLatin1 in particuar? Then what about "isCP1252" (THE most common encoding today), "isShiftJis", etc., for several hundered encodings? (I'm not proposing to remove isAscii, but isLatin1 is dubious.)
isControl - I don't know about this. Why do (did) you want it? There are several "kinds" of "control" characters in Unicode: the traditional C0 and (less used) C1 ones, format control characters (NO, they do NOT control FORMATTING, though they do control FORMAT, like cursive connections), ...
isPrint - Dubious. Is a non-spacing accent a printable character? A combining character is most definitely "printable". (There is a difference between non-spacing and combining, even though many combining characters are non-spacing, not all of them are.)
isSpace - OK, by the comment in the report: "The isSpace function recognizes only white characters in the Latin-1 range". Sigh. There are several others, most importantly: LINE SEPARATOR, PARAGRAPH SEPARATOR, and IDEOGRAPHIC SPACE. And the NEL in the C1 range.
isUpper, isLower - Maybe OK. This is property interrogation. There are many other properties of interest.
toUpper, toLower - Not OK. There are cases where upper casing a character yields two characters. See my other e-mail.
etc. Any program using this library is bound to get confused on Unicode strings. Even before Unicode, there is much functionality missing; for instance, I don't see any way to compare strings using a localized order.
Is anyone working on honest support for Unicode, in the form of a real Unicode library with an interface at the correct level?
Well, IBM's ICU, for one, ... But they only do it for C/C++/Java, not for Haskell... Kind regards /kent k
participants (4)
-
Colin Paul Adams
-
Dylan Thurston
-
Kent Karlsson
-
Ketil Malde