
On Wed, 2010-10-20 at 11:11 -0400, Tyson Whitehead wrote:
On October 19, 2010 19:35:33 Duncan Coutts wrote:
Right, that's a very common misunderstanding of Unicode. A Unicode code point (type Char) does not correspond 1:1 with the human notion of a character. It would be nice if it did, but unfortunately it is not something we can ignore. Because of this it is better not to think of operations on individual Chars but on short sequences of Chars. In any case, when processing text (even ASCII where Chars do match characters) many of the most common operations that you want are substring not element based.
I read the wikipedia article on code points, but still do not feel I have a firm grasp as to what exactly you are referring to.
If you have a few minutes, would you mind providing a short example to clarify this with a specific example (e.g., a specific code point that gives issues with a 1:1 model and what those issues are).
Combining characters are the major one. These are things like accents, but there are many more of them in some other languages. For most of the European languages there are both all-in-one code points that combine the base character with the extra mark (because those already existed in previous character sets), but for many other languages the canonical form is made up of multiple code points (and not necessarily just 2). So if you're searching for a particular "character" then searching for a single Char is not sufficient, you need to search for a short sequence of Chars. Duncan