New subject: Unicode support

9 Oct 2001

      [Posted to haskell-cafe, since it's getting quite off topic]

"Kent Karlsson"  writes:
...
...
...
for a long time. 16 bit unicode should be gotten rid of, being the worst
of both worlds, non backwards compatable with ascii, endianness issues
and no constant length encoding.... utf8 externally and utf32 when
worknig with individual characters is the way to go.
...
...
I totally agree with you.
...
Now, what are your technical arguments for this position?
(B.t.w., UTF-16 isn't going to go away, it's very firmly established.)
What's wrong with the ones already mentioned?

You have endianness issues, and you need to explicitly type text files
or insert BOMs.

An UTF-8 stream limited to 7-bit ASCII simply is that ASCII stream.
When not limited to ASCII, at least it avoids zero bytes and other
potential problems.  UTF-16 will among other things, be full of
NULLs. 

I can understand UCS-2 looking attractive when it looked like a
fixed-length encoding, but that no longer applies.
...
So it is not surprising that most people involved do not consider
UTF-16 a bad idea.  The extra complexity is minimal, and further
surfaces rarely.
But it needs to be there.  It will introduce larger programs, more
bugs, lower efficiency.
...
BMP characters are still (relatively) easy to process, and it saves
memory space and cache misses when large amounts of text data
is processed (e.g. databases).
I couldn't find anything about the relative efficiencies of UTF-8 and
UTF-16 on various languages.  Do you have any pointers?  From a
Scandinavian POV, (using ASCII plus a handful of extra characters)
UTF-8 should be a big win, but I'm sure there are counter examples.

-kzm
-- 
If I haven't seen further, it is by standing in the footprints of giants

Re: Unicode support

Ketil Malde

Kent Karlsson

Ketil Malde

tags

participants (2)