
27 Sep
2007
27 Sep
'07
6:39 a.m.
On 2007-09-27, Deborah Goldsmith
On Sep 26, 2007, at 11:06 AM, Aaron Denney wrote:
UTF-16 has no advantage over UTF-8 in this respect, because of surrogate pairs and combining characters.
Good point.
Well, not so much. As Duncan mentioned, it's a matter of what the most common case is. UTF-16 is effectively fixed-width for the majority of text in the majority of languages. Combining sequences and surrogate pairs are relatively infrequent.
Infrequent, but they exist, which means you can't seek x/2 bytes ahead to seek x characters ahead. All such seeking must be linear for both UTF-16 *and* UTF-8. -- Aaron Denney -><-