[Haskell-cafe] Re: Writing binary files?

14 Sep 2004


      Glynn Clements  writes:
...
...
3. The default encoding is settable from Haskell, defaults to
   ISO-8859-1.
Agreed.
So every haskell program that did more than just passing raw bytes
From stdin to stdout should decode the appropriate environment
variables, and set the encoding by itself?  IMO that's too much of
redundancy, the RTS should actually do that.
...
There are limits to the extent to which this can be achieved. E.g.
what happens if you set the encoding to UTF-8, then call
getDirectoryContents for a directory which contains filenames which
aren't valid UTF-8 strings?
Then you _seriously_ messed up.  Your terminal would produce garbage,
Nautilus would break, ...
...
...
5. The default encoding is settable from Haskell, defaults to the
   locale encoding.
I feel that the default encoding should be one whose decoder cannot
fail, e.g. ISO-8859-1. You should have to explicitly request the use
of the locale's encoding (analogous to calling setlocale(LC_CTYPE, "")
at the start of a C program; there's a good reason why C doesn't do
this without being explicitly told to).
So that any haskell program that doesn't call setlocale and outputs
anything else than US-ASCII will produce garbage on an UTF-8 system?
...
Actually, the more I think about it, the more I think that "simple,
stupid programs" probably shouldn't be using Unicode at all.
Care to give any examples?  Everything that has been mentioned until
now would break with an UTF-8 locale:
    - ls (sorting would break),
    - env (sorting too)
...
I.e. Char, String, string literals, and the I/O functions in
Prelude, IO etc should all be using bytes,
I don't want the same mess as in C, where strings and raw data are the
very same.  Haskell has a nice type system and nicely defined types
for binary data ([Word8]) and for Strings (String), why don't use it?
...
with a distinct wide-character API available for people who want to
make the (substantial) effort involved in writing (genuinely)
internationalised programs.
If you introduce an entirely new "i18n-only" API, then it'll surely
become difficult. :-)
...
Anything that isn't ISO-8859-1 just doesn't work for the most part,
and anyone who wants to provide real I18N first has to work around
the pseudo-I18N that's already there (e.g. convert Chars back into
Word8s so that they can decode them into real Chars).
One more reason to fix the I/O functions to handle encodings and have
a seperate/underlying binary I/O API.
...
Oh, and because bytes are being stored in Chars, the type system won't
help if you neglect to decode a string, or if you decode it twice.
Yes, that's the problem with the current approach, i.e. that there's
no easy way get a list of Word8's out of a handle.
...
...
...
The advantage of assuming ISO-8859-* is that the decoder can't fail;
every possible stream of bytes is valid.
Assuming. But UTF-8 is not ISO-8859-*. When someday I change most of
my files and filenames from ISO-8859-2 to UTF-8, and change the
locale, the assumption will be wrong. I can't change that now, because
too many programs would break.
The current ISO-8859-1 assumption is also wrong. A program written in
Haskell which sorts strings would break for non-ASCII letters even now
that they are ISO-8859-2 unless specified otherwise.
1. In that situation, you can't avoid the encoding issues. It doesn't
matter what the default is, because you're going to have to set the
encoding anyhow.
Why do you always want me to set the encoding?  That should be the job
of the RTS.  It's ok to use a different API to get Strings instead of
Word8's out of a handle, but _manually_ having to set the encoding?
IIRC, Haskell is meant to be portable, and locale handling is pretty
platform-dependent.
...
2. If you assume ISO-8859-1, you can always convert back to Word8
If I want a list of Word8's, then I should be able to get them without
extracting them from a string.
...
then re-decode as UTF-8. If you assume UTF-8, anything which is neither
UTF-8 nor ASCII will fail far more severely than just getting the
collation order wrong.
If I use String's to handle binary data, then I should expect things
to break.  If I want to get text, and it's not in the expected
encoding, then the user has messed up.
...
Well, my view is essentially that files should be treated as
containing bytes unless you explicitly choose to decode them, at
which point you have to specify the encoding.
Why do you always want me to _manually_ specify an encoding?  If I
want bytes, I'll use the (currently being discussed, see beginning of
this thread) binary I/O API, if I want String's (i.e. text), I'll use
the current I/O API (which is pretty text-orientated anyway, see
hPutStrLn, hGetLine, ...).
...
completely new wide-character API for those who wish to use it.
Which would make it horrendously difficult to do even basic I18N.
...
That gets the failed attempt at I18N out of everyone's way with a
minimum of effort and with maximum backwards compatibility for
existing code.
If existing code, expects String's to be just a list of bytes, it's
_broken_.  String's are a list of unicode characters, [Word8] is a
list of bytes.
...
My main concern is that someone will get sick of waiting and make the
wrong "fix", i.e. keep the existing API but default to the locale's
encoding,
That would be my choice and is in line with the Haskell spec.  Binary
I/O should have a completely different API.
...
so that every simple program then has to explicitly set it
back to ISO-8859-1 to get reasonable worst-case behaviour.
Which would be just as bad as your "fix", which would require many
programs to set the locale back to the environment setting, just to
get sorting, accentuations, etc. right.

    Gabriel.

[Haskell-cafe] Re: Writing binary files?

Gabriel Ebner