
Bulat Ziganshin wrote:
Johan wrote:
So it's not clear to me that using UTF-16 makes the program noticeably slower or use more memory on a real program.
it's clear misunderstanding. of course, not every program holds much text data in memory. but some does, and here you will double memory usage
I write programs that hold onto quite a good deal of natural language text; a few million words at least. Getting efficient Unicode for that is a high priority. However, all of that text is in Japanese, Chinese, Arabic, Hindi, Urdu,... That's the reason I want Unicode. I'm pretty sure UTF-16 isn't going to be causing any special problems here. For NLP work, any language with a vaguely ASCII format isn't a problem. We've been shoving English and western European languages into a subset of ASCII for years (heck, we don't even allow real parentheses!). For the mostly English files on my harddrive, UTF-8 is a clear win. But when it comes to programming, I'm not so sure. I'd like to see some good benchmarks and a clear explanation of where the costs are. Relying on intuitions is notoriously bad for these kinds of encoding issues. -- Live well, ~wren