
On Oct 20, 2010, at 19:44, Ian Lynagh wrote:
Johan wrote:
If you process a string code point by code point you might mistakenly confuse a plain "a" (A) with a "å" (A-RING *or* A + COMBINING RING).
But when characters and codepoints are 1:1, you /can/ process code point by code point.
Am I missing something?
AFAIK there are scripts that have so many combinations that Unicode does not have a single codepoints for each character. In Arabic you can have one of 5 vowel signs on each of the 28 letters. But Unicode does not provide 5*28 codepoints for the combinations. That is probably the reason for have these combined characters. Mac OS tries to take all the characters into as many codepoints as possible whereas Windows tries to merge them as much as possible. I don't think there is a good semantics for replace without knowing what (normal) form you're working on. Normally, search/replace and sorting on Unicode are specialized algorithms that cannot be reduces to simple substitutions or permutations. So I suggest to just provide functions on codepoints and let the user struggle with the rest. Cheers, Axel