
[Posted to haskell-cafe, since it's getting quite off topic]
"Kent Karlsson"
for a long time. 16 bit unicode should be gotten rid of, being the worst of both worlds, non backwards compatable with ascii, endianness issues and no constant length encoding.... utf8 externally and utf32 when worknig with individual characters is the way to go.
I totally agree with you.
Now, what are your technical arguments for this position? (B.t.w., UTF-16 isn't going to go away, it's very firmly established.)
What's wrong with the ones already mentioned? You have endianness issues, and you need to explicitly type text files or insert BOMs. An UTF-8 stream limited to 7-bit ASCII simply is that ASCII stream. When not limited to ASCII, at least it avoids zero bytes and other potential problems. UTF-16 will among other things, be full of NULLs. I can understand UCS-2 looking attractive when it looked like a fixed-length encoding, but that no longer applies.
So it is not surprising that most people involved do not consider UTF-16 a bad idea. The extra complexity is minimal, and further surfaces rarely.
But it needs to be there. It will introduce larger programs, more bugs, lower efficiency.
BMP characters are still (relatively) easy to process, and it saves memory space and cache misses when large amounts of text data is processed (e.g. databases).
I couldn't find anything about the relative efficiencies of UTF-8 and UTF-16 on various languages. Do you have any pointers? From a Scandinavian POV, (using ASCII plus a handful of extra characters) UTF-8 should be a big win, but I'm sure there are counter examples. -kzm -- If I haven't seen further, it is by standing in the footprints of giants

----- Original Message -----
From: "Ketil Malde"
for a long time. 16 bit unicode should be gotten rid of, being the worst of both worlds, non backwards compatable with ascii, endianness issues and no constant length encoding.... utf8 externally and utf32 when worknig with individual characters is the way to go.
I totally agree with you.
Now, what are your technical arguments for this position? (B.t.w., UTF-16 isn't going to go away, it's very firmly established.)
What's wrong with the ones already mentioned?
You have endianness issues, and you need to explicitly type text files or insert BOMs.
You have to distinguish between the encoding form (what you use internally) and encoding scheme (externally). For the encoding form, there is no endian issue, just like there is no endian issue for int internally in your program. For the encoding form there is no BOM either (or rather, it should have been removed upon reading, if the data is taken in from an external source). But I agree that the BOM (for all of the Unicode encoding schemes) and the byte order issue (for the non-UTF-8 encoding schemes; the external ones) are a pain. But as I said: they will not go away now, they are too firmly established.
An UTF-8 stream limited to 7-bit ASCII simply is that ASCII stream.
Which is a large portion of the raison d'ĂȘtre for UTF-8.
When not limited to ASCII, at least it avoids zero bytes and other potential problems. UTF-16 will among other things, be full of NULLs.
Yes, and so what? So will a file filled with image data, video clips, or plainly a list of raw integers dumped to file (not formatted as strings). I know, many old utility programs choke on NULL bytes, but that's not Unicode's fault. Further, NULL (as a character) is a perfectly valid character code. Always was.
I can understand UCS-2 looking attractive when it looked like a fixed-length encoding, but that no longer applies.
So it is not surprising that most people involved do not consider UTF-16 a bad idea. The extra complexity is minimal, and further surfaces rarely.
But it needs to be there. It will introduce larger programs, more bugs
True. But implementing normalisation, or case mapping for that matter, is non-trivial too. In practice, the additional complexity with UTF-16 seems small.
, lower efficiency.
Debatable.
BMP characters are still (relatively) easy to process, and it saves memory space and cache misses when large amounts of text data is processed (e.g. databases).
I couldn't find anything about the relative efficiencies of UTF-8 and UTF-16 on various languages. Do you have any pointers? From a Scandinavian POV, (using ASCII plus a handful of extra characters) UTF-8 should be a big win, but I'm sure there are counter examples.
So, how big is our personal hard disk now? 3GiB? 10GiB? How many images, mp3 files and video clips do you have? (I'm sorry, but your argument here is getting old and stale. Very few worry about that aspect anymore. Except when it comes to databases stored in RAM and UTF-16 vs. UTF-32 which is guaranteed to be wasteful.) Kind regards /kent k

"Kent Karlsson"
You have endianness issues, and you need to explicitly type text files or insert BOMs.
You have to distinguish between the encoding form (what you use internally) and encoding scheme (externally).
Good point, of course. Most of the arguments apply to the external encoding scheme, but I suppose it wasn't clear which of them we were discussing.
But as I said: they will not go away now, they are too firmly established.
Yep. But it appears that the "right" choice for external encoding scheme would be UTF-8.
When not limited to ASCII, at least it avoids zero bytes and other potential problems. UTF-16 will among other things, be full of NULLs.
Yes, and so what?
So, I can use it for file names, in regular expressions, and in whatever legacy applications that expect textual data. That may be worthless to you, but it isn't to me.
So will a file filled with image data, video clips, or plainly a list of raw integers dumped to file (not formatted as strings).
But none of these pretend to be text!
True. But implementing normalisation, or case mapping for that matter, is non-trivial too. In practice, the additional complexity with UTF-16 seems small.
All right, but if there are no real advantages, why bother?
I couldn't find anything about the relative efficiencies of UTF-8 and UTF-16 on various languages.
So, how big is our personal hard disk now? 3GiB? 10GiB? How many images, mp3 files and video clips do you have? (I'm sorry, but your argument here is getting old and stale.
Don't be sorry. I'm just looking for a good argument in favor of UTF-16 instead of UTF-8, and size was the only possibility I could think of offhand. (And apparently, the Japanese are unhappy with the 50% increase UTF-8's three-byte encoding over UTF-16's two-byte one) You could run the same argument against UTF-16 vs UTF-32 as internal encoding form, memory and memory bandwidth is getting cheap these days, too, although memory is still a more expensive resource than disk. But as (I assume) the internal encoding form shouldn't matter (as) much, as it would be hidden from everybody but the Unicode library implementor. It boils down to performance, which can be measured. -kzm -- If I haven't seen further, it is by standing in the footprints of giants
participants (2)
-
Kent Karlsson
-
Ketil Malde