
On 26 Sep 2007, at 7:05 pm, Johan Tibell wrote:
If UTF-16 is what's used by everyone else (how about Java? Python?) I think that's a strong reason to use it. I don't know Unicode well enough to say otherwise.
Java uses 16-bit variables to hold characters. This is SOLELY for historical reasons, not because it is a good choice. The history is a bit funny: the ISO 10646 group were working away defining a 31-bit character set, and the industry screamed blue murder about how this was going to ruin the economy, bring back the Dark Ages, &c, and promptly set up the Unicode consortium to define a 16-bit character set that could do the same job. Early versions of Unicode had only about 30 000 characters, after heroic (and not entirely appreciated) efforts at unifiying Chinese characters as used in China with those used in Japan and those used in Korea. They also lumbered themselves (so that they would have a fighting chance of getting Unicode adopted) with a "round trip conversion" policy, namely that it should be possible to take characters using ANY current encoding standard, convert them to Unicode, and then convert back to the original encoding with no loss of information. This led to failure of unification: there are two versions of Å (one for ordinary use, one for Angstroms), two versions of mu (one for Greek, one for micron), three complete copies of ASCII, &c). However, 16 bits really is not enough. Here's a table from http://www.unicode.org/versions/Unicode5.0.0/ Graphic 98,884 Format 140 Control 65 Private Use 137,468 Surrogate 2,048 Noncharacter 66 Reserved 875,441 Excluding Private Use and Reserved, I make that 101,203 currently defined codes. That's nearly 1.5* the number that would fit in 16 bits. Java has had to deal with this, don't think it hasn't. For example, where Java had one set of functions referring to characters in strings by position, it now has two complete sets: one to use *which 16-bit code* (which is fast) and one to use *which actual Unicode character* (which is slow). The key point is that the second set is *always* slow even when there are no characters outside the basic multilingual plane. One Smalltalk system I sometimes use has three complete string implementations (all characters fit in a byte, all characters fit in 16 bits, some characters require more) and dynamically switches from narrow strings to wide strings behind your back. In a language with read-only strings, that makes a lot of sense; it's just a pity Smalltalk isn't one. If you want to minimize conversion effort when talking to the operating system, files, and other programs, UTF-8 is probably the way to go. (That's on Unix. For Windows it might be different.) If you want to minimize the effort of recognising character boundaries while processing strings, 32-bit characters are the way to go. If you want to be able to index into a string efficiently, they are the *only* way to go. Solaris bit the bullet many years ago; Sun C compilers jumped straight from 8-bit wchar_t to 32_bit without ever stopping at 16. 16-bit characters *used* to be a reasonable compromise, but aren't any longer. Unicode keeps on growing. There were 1,349 new characters from Unicode 4.1 to Unicode 5.0 (IIRC). There are lots more scripts in the pipeline. (What the heck _is_ Tangut, anyway?)