
----- Original Message -----
From: "Ketil Malde"
for a long time. 16 bit unicode should be gotten rid of, being the worst of both worlds, non backwards compatable with ascii, endianness issues and no constant length encoding.... utf8 externally and utf32 when worknig with individual characters is the way to go.
I totally agree with you.
Now, what are your technical arguments for this position? (B.t.w., UTF-16 isn't going to go away, it's very firmly established.)
What's wrong with the ones already mentioned?
You have endianness issues, and you need to explicitly type text files or insert BOMs.
You have to distinguish between the encoding form (what you use internally) and encoding scheme (externally). For the encoding form, there is no endian issue, just like there is no endian issue for int internally in your program. For the encoding form there is no BOM either (or rather, it should have been removed upon reading, if the data is taken in from an external source). But I agree that the BOM (for all of the Unicode encoding schemes) and the byte order issue (for the non-UTF-8 encoding schemes; the external ones) are a pain. But as I said: they will not go away now, they are too firmly established.
An UTF-8 stream limited to 7-bit ASCII simply is that ASCII stream.
Which is a large portion of the raison d'ĂȘtre for UTF-8.
When not limited to ASCII, at least it avoids zero bytes and other potential problems. UTF-16 will among other things, be full of NULLs.
Yes, and so what? So will a file filled with image data, video clips, or plainly a list of raw integers dumped to file (not formatted as strings). I know, many old utility programs choke on NULL bytes, but that's not Unicode's fault. Further, NULL (as a character) is a perfectly valid character code. Always was.
I can understand UCS-2 looking attractive when it looked like a fixed-length encoding, but that no longer applies.
So it is not surprising that most people involved do not consider UTF-16 a bad idea. The extra complexity is minimal, and further surfaces rarely.
But it needs to be there. It will introduce larger programs, more bugs
True. But implementing normalisation, or case mapping for that matter, is non-trivial too. In practice, the additional complexity with UTF-16 seems small.
, lower efficiency.
Debatable.
BMP characters are still (relatively) easy to process, and it saves memory space and cache misses when large amounts of text data is processed (e.g. databases).
I couldn't find anything about the relative efficiencies of UTF-8 and UTF-16 on various languages. Do you have any pointers? From a Scandinavian POV, (using ASCII plus a handful of extra characters) UTF-8 should be a big win, but I'm sure there are counter examples.
So, how big is our personal hard disk now? 3GiB? 10GiB? How many images, mp3 files and video clips do you have? (I'm sorry, but your argument here is getting old and stale. Very few worry about that aspect anymore. Except when it comes to databases stored in RAM and UTF-16 vs. UTF-32 which is guaranteed to be wasteful.) Kind regards /kent k