
On October 20, 2010 15:45:44 Axel Simon wrote:
AFAIK there are scripts that have so many combinations that Unicode does not have a single codepoints for each character. In Arabic you can have one of 5 vowel signs on each of the 28 letters. But Unicode does not provide 5*28 codepoints for the combinations. That is probably the reason for have these combined characters.
Mac OS tries to take all the characters into as many codepoints as possible whereas Windows tries to merge them as much as possible. I don't think there is a good semantics for replace without knowing what (normal) form you're working on. Normally, search/replace and sorting on Unicode are specialized algorithms that cannot be reduces to simple substitutions or permutations.
Thanks to everyone for the examples. Given that not all combined characters can be reduced to a single code point (from your first paragraph), it would seem that MacOS normalization has a conceptual advantage over Windows normalization. Specifically, it is appealing that the normalized string is in some sense less complex in that it only contains elementary codepoints (ones that can't be further decomposed) and compositions. The other would still contain a mix. Am I correct then in understanding that, from the view of strings as a vector/list of elementary chars, the elementary chars would actually have to be a codepoint plus an arbitrary number of additional composition codepoints in order to correspond well to the human notion of a character. This then doesn't map well onto the existing vector/list style interfaces because this elementary char type is not a simple enumeration to be treated atomically. Operations would actually need to frequently look inside it (e.g., replace base codepoints irrespective of the compositional codepoints). Cheers! -Tyson