Re: Haskell Platform Proposal: add the 'text' library

21 Oct 2010

      On October 20, 2010 15:45:44 Axel Simon wrote:
...
AFAIK there are scripts that have so many combinations that Unicode
does not have a single codepoints for each character. In Arabic you
can have one of 5 vowel signs on each of the 28 letters. But Unicode
does not provide 5*28 codepoints for the combinations. That is
probably the reason for have these combined characters.
Mac OS tries to take all the characters into as many codepoints as
possible whereas Windows tries to merge them as much as possible. I
don't think there is a good semantics for replace without knowing what
(normal) form you're working on. Normally, search/replace and sorting
on Unicode are specialized algorithms that cannot be reduces to simple
substitutions or permutations.
Thanks to everyone for the examples.

Given that not all combined characters can be reduced to a single code point 
(from your first paragraph), it would seem that MacOS normalization has a 
conceptual advantage over Windows normalization.

Specifically, it is appealing that the normalized string is in some sense less 
complex in that it only contains elementary codepoints (ones that can't be 
further decomposed) and compositions.  The other would still contain a mix.

Am I correct then in understanding that, from the view of strings as a 
vector/list of elementary chars, the elementary chars  would actually have to 
be a codepoint plus an arbitrary number of additional composition codepoints 
in order to correspond well to the human notion of a character.

This then doesn't map well onto the existing vector/list style interfaces 
because this elementary char type is not a simple enumeration to be treated 
atomically.  Operations would actually need to frequently look inside it 
(e.g., replace base codepoints irrespective of the compositional codepoints).

Cheers!  -Tyson

Re: Haskell Platform Proposal: add the 'text' library

Tyson Whitehead