Re: [Haskell-cafe] Writing binary files?

12 Sep 2004

      Marcin 'Qrczak' Kowalczyk wrote:
...
...
...
But the default encoding should
come from the locale instead of being ISO-8859-1.
The problem with that is that, if the locale's encoding is UTF-8, a
lot of stuff is going to break (i.e. anything in ISO-8859-* which
isn't limited to the 7-bit ASCII subset).
What about this transition path:
1. API for manipulating byte sequences in I/O (without representing
   them in String type).
Note that this needs to include all of the core I/O functions, not
just reading/writing streams. E.g. FilePath is currently an alias for
String, but (on Unix, at least) filenames are strings of bytes, not
characters. Ditto for argv, environment variables, possibly other
cases which I've overlooked.
...
2. API for conversion between explicitly specified encodings and byte
   sequences, including attaching converters to Handles. There is also
   a way to obtain the locale encoding.
3. The default encoding is settable from Haskell, defaults to
   ISO-8859-1.
Agreed.
...
4. Libraries are reviewed to ensure that they work with various
   encoding settings.
There are limits to the extent to which this can be achieved. E.g. 
what happens if you set the encoding to UTF-8, then call
getDirectoryContents for a directory which contains filenames which
aren't valid UTF-8 strings?
...
5. The default encoding is settable from Haskell, defaults to the
   locale encoding.
I feel that the default encoding should be one whose decoder cannot
fail, e.g. ISO-8859-1. You should have to explicitly request the use
of the locale's encoding (analogous to calling setlocale(LC_CTYPE, "")
at the start of a C program; there's a good reason why C doesn't do
this without being explicitly told to).

Actually, the more I think about it, the more I think that "simple,
stupid programs" probably shouldn't be using Unicode at all.

I.e. Char, String, string literals, and the I/O functions in Prelude,
IO etc should all be using bytes, with a distinct wide-character API
available for people who want to make the (substantial) effort
involved in writing (genuinely) internationalised programs.

Right now, the attempt at providing I18N "for free", by defining Char
to mean Unicode, has essentially backfired, IMHO. Anything that isn't
ISO-8859-1 just doesn't work for the most part, and anyone who wants
to provide real I18N first has to work around the pseudo-I18N that's
already there (e.g. convert Chars back into Word8s so that they can
decode them into real Chars).

Oh, and because bytes are being stored in Chars, the type system won't
help if you neglect to decode a string, or if you decode it twice.
...
...
The advantage of assuming ISO-8859-* is that the decoder can't fail;
every possible stream of bytes is valid.
Assuming. But UTF-8 is not ISO-8859-*. When someday I change most of
my files and filenames from ISO-8859-2 to UTF-8, and change the
locale, the assumption will be wrong. I can't change that now, because
too many programs would break.
The current ISO-8859-1 assumption is also wrong. A program written in
Haskell which sorts strings would break for non-ASCII letters even now
that they are ISO-8859-2 unless specified otherwise.
1. In that situation, you can't avoid the encoding issues. It doesn't
matter what the default is, because you're going to have to set the
encoding anyhow.

2. If you assume ISO-8859-1, you can always convert back to Word8 then
re-decode as UTF-8. If you assume UTF-8, anything which is neither
UTF-8 nor ASCII will fail far more severely than just getting the
collation order wrong.
...
...
The key problem with using the locale is that you frequently encounter
files which aren't in the locale's encoding, and for which the
encoding can't easily be deduced.
Programs should either explicitly set the encoding for I/O on these
files to ISO-8859-1, or manipulate them as binary data.
Well, my view is essentially that files should be treated as
containing bytes unless you explicitly choose to decode them, at which
point you have to specify the encoding.
...
The problem is that API for that yet is not even designed, so programs
can't be written such that they will work after the default encoding
change.
Personally, I would take the C approach: redefine Char to mean a byte
(i.e. CChar), treat string literals as bytes, keep the existing type
signatures on all of the existing Haskell98 functions, and provide a
completely new wide-character API for those who wish to use it.

That gets the failed attempt at I18N out of everyone's way with a
minimum of effort and with maximum backwards compatibility for
existing code.

Given the frequency with which this issue crops up, and the associated
lack of action to date, I'd rather not have to wait until someone
finally gets around to designing the new, improved,
genuinely-I18N-ised API before we can read/write arbitrary files
without too much effort.

My main concern is that someone will get sick of waiting and make the
wrong "fix", i.e. keep the existing API but default to the locale's
encoding, so that every simple program then has to explicitly set it
back to ISO-8859-1 to get reasonable worst-case behaviour.

-- 
Glynn Clements