RE: Unicode support

> -----Original Message-----
> From: Ketil Malde [mailto:ketil@ii.uib.no]
...
> > But as I said: they will not go away now, they are too
> firmly established.
>
> Yep. But it appears that the "right" choice for external encoding
> scheme would be UTF-8.

You're free to use any one of UTF-8, UTF-16(BE/LE), or UTF-32(BE/LE).
And you should be prepared for those encodings from anyone else.
In some cases, like e-mail, only UTF-8 (unless encoded by base64)
can be used for Unicode.

> >> When not limited to ASCII, at least it avoids zero bytes and other
> >> potential problems. UTF-16 will among other things, be full of
> >> NULLs.
>
> > Yes, and so what?
>
> So, I can use it for file names,

Millions of people do already, including me. Most of them don't even know
about it. (The file system must be made for that, of course, but at least two
(commonly used) file system use UTF-16 for *all* [long] file names: NTFS and
HFS+. There is another file system, UFS, that uses UTF-8, with the names
in normal form D(!), for all file names. (If the *standard* C file API is used,
some kind of conversion is triggered.) Those file systems got it right, many
other file systems are at a loss when it comes to even that simple level of I18N,
rendering non-pure-ASCII file names essentially useless, or at least unreliable.)

> in regular expressions,

If the system interpreting the RE is UTF-16 enabled, yes, of course.

> and in
> whatever legacy

No: in modern systems. One of the side effects of the popularity of XML is that
support for both UTF-8 and UTF-16 (also as external encodings) is growing...

B.t.w. Java source code can be in UTF-8 or in UTF-16, as well as in legacy
encodings. Unfortunately the compiler has to be steered via a command line
parameter, while having the source files self declare their encoding would be
much better (compare XML).

> applications that expect textual data.

> > So will a file filled with image data, video clips, or plainly a
> > list of raw integers dumped to file (not formatted as strings).
>
> But none of these pretend to be text!

How is that relevant? If you're going to do anything "higher-level" with text,
you have to know the encoding, otherwise you'll get lots of more or less
hidden bugs. Have you ever had any experience with any of the legacy
"multibyte" encodings used for Chinese/Japanese/etc.? In many of them,
if you hit a byte that might be an ASCII letter, it need not be that at all,
just a second byte component in the representation of a non-ASCII character.
If you think every "A" byte is an "A" (an interpret them in some special way,
say a (part of a) command name), you're in trouble! Often hard-to-find
trouble. No-one that argues that one can take text in any "ASCII extension"
and look at the (apparent) ASCII only (and everything else to be in some
arbitrary extension, never affecting the processing) seems to be aware of
the details of those encodings.

B.t.w. video clips (and images) can and do have Unicode (UTF-16?) texts as
components (e.g. subtitles).

> > True. But implementing normalisation, or case mapping for
> that matter,
> > is non-trivial too. In practice, the additional complexity with
> > UTF-16 seems small.
>
> All right, but if there are no real advantages, why bother?

Efficiency (and backwards compatibility) is claimed from people who
work much more "in the trenches" with this than I do. And I have no
quarrel with that.

Kind regards
/kent k