
"Kent Karlsson"
You have endianness issues, and you need to explicitly type text files or insert BOMs.
You have to distinguish between the encoding form (what you use internally) and encoding scheme (externally).
Good point, of course. Most of the arguments apply to the external encoding scheme, but I suppose it wasn't clear which of them we were discussing.
But as I said: they will not go away now, they are too firmly established.
Yep. But it appears that the "right" choice for external encoding scheme would be UTF-8.
When not limited to ASCII, at least it avoids zero bytes and other potential problems. UTF-16 will among other things, be full of NULLs.
Yes, and so what?
So, I can use it for file names, in regular expressions, and in whatever legacy applications that expect textual data. That may be worthless to you, but it isn't to me.
So will a file filled with image data, video clips, or plainly a list of raw integers dumped to file (not formatted as strings).
But none of these pretend to be text!
True. But implementing normalisation, or case mapping for that matter, is non-trivial too. In practice, the additional complexity with UTF-16 seems small.
All right, but if there are no real advantages, why bother?
I couldn't find anything about the relative efficiencies of UTF-8 and UTF-16 on various languages.
So, how big is our personal hard disk now? 3GiB? 10GiB? How many images, mp3 files and video clips do you have? (I'm sorry, but your argument here is getting old and stale.
Don't be sorry. I'm just looking for a good argument in favor of UTF-16 instead of UTF-8, and size was the only possibility I could think of offhand. (And apparently, the Japanese are unhappy with the 50% increase UTF-8's three-byte encoding over UTF-16's two-byte one) You could run the same argument against UTF-16 vs UTF-32 as internal encoding form, memory and memory bandwidth is getting cheap these days, too, although memory is still a more expensive resource than disk. But as (I assume) the internal encoding form shouldn't matter (as) much, as it would be hidden from everybody but the Unicode library implementor. It boils down to performance, which can be measured. -kzm -- If I haven't seen further, it is by standing in the footprints of giants