Re: [Haskell-cafe] Writing binary files?

15 Sep 2004

      Marcin 'Qrczak' Kowalczyk wrote:
...
...
...
When I switch my environment to UTF-8, which may happen in a few
years, I will convert filenames to UTF-8 and set up mount options to
translate vfat filenames to/from UTF-8 instead of to ISO-8859-2.
But what about files which were been created by other people, who
don't use UTF-8?
All people sharing a filesystem should use the same encoding.
Again, this is just "hand waving" the issues away.
...
BTW, when ftping files between Windows and Unix, a good ftp client
should convert filenames to keep the same characters rather than
bytes, so CP-1250 encoded names don't come as garbage in the encoding
used on Unix which is definitely different (ISO-8859-2 or UTF-8) or
vice versa.
Which is fine if the FTP client can figure out which encoding is used
on the remote end. In practice, you have to tell it, i.e. have a list
of which servers (or even which directories on which servers) use
which encoding.
...
...
...
I expect good programs to understand that and display them
correctly no matter what technique they are using for the display.
When it comes to display, you have to have to deal with encoding
issues one way or another. But not all programs deal with display.
So you advocate using multiple encodings internally. This is in
general more complicated than what I advocate: using only Unicode
internally, limiting other encodings to I/O boundary.
How do you draw that conclusion from what I wrote here?

There are cases where it's advantages to use multiple encodings, but I
wasn't suggesting that in the above. What I'm suggesting in the above
is to sidestep the encoding issue by keeping filenames as byte strings
wherever possible.
...
...
The core OS and network server applications essentially remain
encoding-agnostic.
Which is a problem when they generate an email, e.g. to send a
non-empty output of a cron job, or report unauthorized use of sudo.
If the data involved is not pure ASCII, I will often be mangled.
It only gets mangled if you feed it to a program which is making
assumptions about the encoding. Non-MIME messages neither specify nor
imply an encoding. MIME messages can use either
"text/plain; charset=x-unknown" or application/octet-stream if they
don't undertand the encoding.

And program-generated email notifications frequently include text with
no known encoding (i.e. binary data). Or are you going to demand that
anyone who tries to hack into your system only sends it UTF-8 data so
that the alert messages are displayed correctly in your mail program?
...
It's rarely a problem in practice because filenames, command
arguments, error messages, user full names etc. are usually pure
ASCII. But this is slowly changing.
To the extent that non-ASCII filenames are used, I've encountered far
more filenames in both Latin1 and ISO-2022 than in UTF-8. Japanese FTP
sites typically use ISO-2022 for everything; even ASCII names may have
"\e(B" prepended to them.
...
...
But, as I keep pointing out, filenames are byte strings, not
character strings. You shouldn't be converting them to character
strings unless you have to.
Processing data in their original byte encodings makes supporting
multiple languages harder. Filenames which are inexpressible as
character strings get in the way of clean APIs. When considering only
filenames, using bytes would be sufficient, but in overall it's more
convenient to Unicodize them like other strings.
It also harms reliability. Depending upon the encoding, two distinct
byte strings may have the same Unicode representation.

E.g. if you are interfacing to a server which uses ISO-2022 for
filenames, you have to get the escapes correct even when they are
no-ops in terms of the string representation. If you obtain a
directory listing, receive the filename "\e(Bfoo.txt", and convert it
to Unicode, you get "foo.txt". If you then convert it back without the
leading escape, the server is going to say "file not found".
...
...
The term "mismatch" implies that there have to be at least two things.
If they don't match, which one is at fault? If I make a tar file
available for you to download, and it contains non-UTF-8 filenames, is
that my fault or yours?
Such tarballs are not portable across systems using different encodings.
Well, programs which treat filenames as byte strings to be read from
argv[] and passed directly to open() won't have any problems with
this. It's only a problem if you make it a problem.
...
If I tar a subdirectory stored on ext2 partition, and you untar it on
a vfat partition, whose fault it is that files which differ only in
case are conflated?
Arguably, it's Microsoft's fault for not considering the problems
caused by multiple encodings when they decided that filenames were
going to be case-folded.
...
...
In any case, if a program refuses to deal with a file because it is
cannot convert the filename to characters, even when it doesn't have
to, it's the program which is at fault.
Only if it's a low-level utility, to be used in an unfriendly
environment.
A Haskell program in my world can do that too. Just set the encoding
to Latin1.
But programs should handle this by default, IMHO. Filenames are, for
the most part, just "tokens" to be passed around. You get a value from
argv[], and pass it to open() or whatever. It doesn't need to have any
meaning.
...
...
My specific point is that the Haskell98 API has a very big problem due
to the assumption that the encoding is always known. Existing
implementations work around the problem by assuming that the encoding
is always ISO-8859-1.
The API is incomplete and needs to be enhanced. Programs written using
the current API will be limited to using the locale encoding.
That just adds unnecessary failure modes.
...
Just as ReadFile is limited to text files because of line endings.
What do you prefer: to provide a non-Haskell98 API for binary files,
or to "fix" the current API by forcing programs to use "\r\n" on
Windows and "\n" on Unix manually?
That's a harder case. There is a good reason for auto-converting EOL,
as most programs actually process file contents. Most programs don't
"process" filenames; they just pass them around.
...
...
...
If filenames were expressed as bytes in the Haskell program, how would
you map them to WinAPI? If you use the current Windows code page, the
set of valid characters is limited without a good reason.
Windows filenames are arguably characters rather than bytes. However,
if you want to present a common API, you can just use a fixed encoding
on Windows (either UTF-8 or UTF-16).
This encoding would be incompatible with most other texts seen by the
program. In particular reading a filename from a file would not work
without manual recoding.
We already have that problem; you can't read non-Latin1 strings from
files.

In some regards, the problem is worse on Windows, because of the
prevalence of non-ASCII text (Windows 12xx and "smart" quotes), so
using UTF-8 for file contents on Windows is even harder.
...
...
...
Which is a pity. ISO-2022 is brain-damaged because of enormous
complexity,
Or, depending upon ones perspective, Unicode is brain-damaged because,
for the sake of simplicity, it over-simplifies the situation. The
over-simplification is one reason for it's lack of adoption in the CJK
world.
It's necessary to simplify things in order to make them usable by
ordinary programs. People reject overly complicated designs even if
they are in some respects more general.
ISO-2022 didn't catch - about the only program I've seen which tries
to fully support it is Emacs.
And X. Compound text is ISO-2022. For commercial X software, Motif
(which uses compound text) is still the most widely-used toolkit.

But, then, the fact that you haven't seen many ISO-2022 programs is
probably because you're used to using programs developed by and for
Westerners. In the far East, ISO-2022 is by far the most popular
encoding. There, you could realistically ignore all other encodings.

BTW, that's why Emacs (and XEmacs) support ISO-2022 much better than
they do UTF-8. Because MuLE was written by Japanese developers.
...
...
Multi-lingual text consists of distinct sections written in distinct
languages with distinct "alphabets". It isn't actually one big chunk
in a single global language with a single massive alphabet.
Multi-lingual text is almost context-insensitive. You can copy a part
of it into another text, even written in another language, and it will
retain its alphabet - this is much harder with stateful ISO-2022.
ISO-2022 is wrong not by distinguishing alphabets but by being
stateful.
Sure, the statefulness adds complexity (which is one of the reasons so
many people prefer to work with UTF-8), but it has the benefit of
providing distinct markers to indicate where the character set is
being switched (that isn't a compelling advantage; you could
reconstruct the markers if you could uniquely determine the character
set for each character).

OTOH, Unicode is wrong by not distinguishing character sets. This is a
significant reason why it hasn't been adopt in the far East
(specifically, Han unification).
...
...
...
and ISO-8859-x have small repertoires.
Which is one of the reasons why they are likely to persist for longer
than UTF-8 "true believers" might like.
My I/O design doesn't force UTF-8, it works with ISO-8859-x as well.
But I was specifically addressing Unicode versus multiple encodings
internally. The size of the Unicode "alphabet" effectively prohibits
using codepoints as indices.

-- 
Glynn Clements