Re: Haskell Platform Proposal: add the 'text' library

21 Oct 2010

      On Wed, 2010-10-20 at 11:11 -0400, Tyson Whitehead wrote:
...
On October 19, 2010 19:35:33 Duncan Coutts wrote:
...
Right, that's a very common misunderstanding of Unicode. A Unicode
code point (type Char) does not correspond 1:1 with the human notion
of a character. It would be nice if it did, but unfortunately it is
not something we can ignore. Because of this it is better not to think
of operations on individual Chars but on short sequences of Chars. In
any case, when processing text (even ASCII where Chars do match
characters) many of the most common operations that you want are
substring not element based.
I read the wikipedia article on code points, but still do not feel I have a 
firm grasp as to what exactly you are referring to.
If you have a few minutes, would you mind providing a short example to clarify 
this with a specific example (e.g., a specific code point that gives issues with 
a 1:1 model and what those issues are).
Combining characters are the major one. These are things like accents,
but there are many more of them in some other languages. For most of the
European languages there are both all-in-one code points that combine
the base character with the extra mark (because those already existed in
previous character sets), but for many other languages the canonical
form is made up of multiple code points (and not necessarily just 2).

So if you're searching for a particular "character" then searching for a
single Char is not sufficient, you need to search for a short sequence
of Chars.

Duncan