Re: [Haskell-cafe] Writing binary files?

13 Sep 2004

      Glynn Clements  writes:
...
Unless you are the sole user of a system, you have no control over
what filenames may occur on it (and even if you are the sole user,
you may wish to use packages which don't conform to your rules).
For these occasions you may set the encoding to ISO-8859-1. But then
you can't sensibly show them to the user in a GUI, nor in ncurses
using the wide character API, nor you can't sensibly store them in a
file which is to be always encoded in UTF-8 (e.g. XML file where you
can't put raw bytes without knowing their encoding).

There are two paradigms: manipulate bytes not knowing their encoding,
and manipulating characters explicitly encoded in various encodings
(possibly UTF-8). The world is slowly migrating from the first to the
second.
...
...
...
There are limits to the extent to which this can be achieved. E.g. 
what happens if you set the encoding to UTF-8, then call
getDirectoryContents for a directory which contains filenames which
aren't valid UTF-8 strings?
The library fails. Don't do that. This environment is internally
inconsistent.
Call it what you like, it's a reality, and one which programs need to
deal with.
The reality is that filenames are encoded in different encodings
depending on the system. Sometimes it's ISO-8859-1, sometimes
ISO-8859-2, sometimes UTF-8. We should not ignore the possibility
of UTF-8-encoded filenames.

In CLisp it fails silently (undecodable filenames are skipped), which
is bad. It should fail loudly.
...
Most programs don't care whether any filenames which they deal with
are valid in the locale's encoding (or any other encoding). They just
receive lists (i.e. NUL-terminated arrays) of bytes and pass them
directly to the OS or to libraries.
And this is why I can't switch my home environment to UTF-8 yet. Too
many programs are broken; almost all terminal programs which use more
than stdin and stdout in default modes, i.e. which use line editing or
work in full screen. How would you display a filename in a full screen
text editor, such that it works in a UTF-8 environment?
...
If the assumed encoding is ISO-8859-*, this program will work
regardless of the filenames which it is passed or the contents of the
file (modulo the EOL translation on Windows). OTOH, if it were to use
UTF-8 (e.g. because that was the locale's encoding), it wouldn't work
correctly if either filename or the file's contents weren't valid
UTF-8.
A program is not supposed to encounter filenames which are not
representable in the locale's encoding. In your setting it's
impossible to display a filename in a way other than printing
to stdout.
...
More accurately, it specifies which encoding to assume when you *need*
to know the encoding (i.e. ctype.h etc), but you can't obtain that
information from a more reliable source.
In the case of filenames there is no more reliable source.
...
My central point is that the existing API forces the encoding to be
an issue when it shouldn't be.
It is an unavoidable issue because not every interface in a given
computer system uses the same encoding. Gtk+ uses UTF-8; you must
convert text to UTF-8 in order to display it, and in order to convert
you must know its encoding.
...
Well, to an extent it is an implementation issue. Historically, curses
never cared about encodings. A character is a byte, you draw bytes on
the screen, curses sends them directly to the terminal.
This is the old API. But newer ncurses API is prepared even for
combining accents. A character is coded with a sequence of wchar_t
values, such that all except the first one are combining characters.
...
Furthermore, the curses model relies upon monospaced fonts, and falls
down once you encounter CJK text (where a "monospaced" font means one
whose glyphs are an integer multiple of the cell size, not necessarily
a single cell).
It doesn't fall. Characters may span several columns. There is wcwidth(),
and curses specification in X/Open says how it should behave for wide
CJK characters. I haven't tested it but I believe ncurses supports
them.
...
Extending something like curses to handle encoding issues is far
from trivial; which is probably why it hasn't been finished yet.
It's almost finished. The API specification was ready in 1997.
It works in ncurses modulo unfixed bugs.

But programs can't use it unless they use Unicode internally.
...
Although, if you're going to have implicit String -> [Word8]
converters, there's no reason why you can't do the reverse, and have
isAlpha :: Word8 -> IO Bool. Although, like ctype.h, this will only
work for single-byte encodings.
We should not ignore multibyte encodings like UTF-8, which means that
Haskell should have a Unicoded character type. And it's already
specified in Haskell 98 that Char is such a type!

What is missing is API for manipulating binary files, and conversion
between byte streams and character streams using particular text
encodings.
...
...
A mail client is expected to respect the encoding set in headers.
A client typically needs to know the encoding in order to display
the text.
This is easier to handle when String type means Unicode.
...
As a counter-example, a mail *server* can do its job without paying
any attention to the encodings used. It can also handle non-MIME email
(which doesn't specify any encoding) regardless of the encoding.
So it should push bytes, not characters.
...
...
This is why I said "1. API for manipulating byte sequences in I/O
(without representing them in String type)".
Yes. But that API also needs to include functions such as those in the
Directory and System modules.
If deemed really necessary, I will not fight against them.
...
It isn't just about reading and writing streams. Most of the Unix
API (kernel, libc, and many standard libraries) is byte-oriented
rather than character-oriented.
Because they are primarily used from C, which use the older paradigm
of handling text: represent it in an unspecified external encoding
rather than in Unicode.

OTOH newer Windows APIs use Unicode.

Haskell aims at being portable. It's easier to emulate the traditional
C paradigm in the Unicode paradigm than vice versa, and Haskell
already tries to specify that it uses Unicode internally.
...
...
...
2. If you assume ISO-8859-1, you can always convert back to Word8 then
re-decode as UTF-8. If you assume UTF-8, anything which is neither
UTF-8 nor ASCII will fail far more severely than just getting the
collation order wrong.
If I know the encoding, I should set the I/O handle to that encoding
in the first place instead of reinterpreting characters which have
been read using the default.
And if you don't know the encoding?
Then it's not possible to recode it to something else.

But when it is possible because the encoding is known, it's easier
to use a single internal encoding everywhere than to determine two
encodings on each transition.
...
Agreed. But writing programs which support I18N, multi-byte encodings,
wide character sets (>256 codepoints) and the like on an OS whose core
API is byte-oriented involves work.
It's not that hard if you may sacrifice supporting every broken
configuration. I did it myself, albeit without serious testing in real
world situations and without trying to interface to too many libraries.
...
And it can't all be hidden within a library. Some of the work falls on
the application programmers, who have to deal with determining the
correct encoding in each situation, converting between encodings,
handling encoding and decoding failures (e.g. when you encounter a
Unicode filename but the terminal only has Latin1), and so on.
Indeed.
...
My view is that, right now, we have the worst of both worlds, and
taking a short step backwards (i.e. narrow the Char type and leave the
rest alone) is a lot simpler (and more feasible) than the long journey
towards real I18N.
It would bury any hope in supporting a UTF-8 environment.

I've heard that RedHat tried to impose UTF-8 by default. It was mostly
a failure because it's too early, too many programs are not ready for
it. I guess the RedHat move helped to identify some of them. But UTF-8
will inevitably be usable in future.

It would be great if Haskell programs were in the group which can
support it instead of being forced to be abandoned because of lack
of Unicode support in the language they are written in.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

Re: [Haskell-cafe] Writing binary files?

Marcin 'Qrczak' Kowalczyk