Re: [Haskell-cafe] Re: Writing binary files?

15 Sep 2004

      Gabriel Ebner wrote:
...
...
...
3. The default encoding is settable from Haskell, defaults to
   ISO-8859-1.
Agreed.
So every haskell program that did more than just passing raw bytes
From stdin to stdout should decode the appropriate environment
variables, and set the encoding by itself?
This statement is too restrictive. Passing bytes isn't limited to
stdin->stdout, and there's no reason why setting the encoding needs to
be any more involved than e.g. "setLocaleEncoding". If you change it
to:
...
So every haskell program that did more than just passing raw bytes
... should ... set the encoding by itself?
then the answer is yes.
...
IMO that's too much of
redundancy, the RTS should actually do that.
The RTS doesn't know the encoding. Assuming that the data will use the
locale's encoding will be wrong too often.
...
...
There are limits to the extent to which this can be achieved. E.g.
what happens if you set the encoding to UTF-8, then call
getDirectoryContents for a directory which contains filenames which
aren't valid UTF-8 strings?
Then you _seriously_ messed up.  Your terminal would produce garbage,
Nautilus would break, ...
Like so many other people, you're making an argument based upon
fiction (specifically, that you have a closed world where everything
always uses the same encoding) then deeming anyone who is unable to
maintain the fiction to be "wrong".
...
...
...
5. The default encoding is settable from Haskell, defaults to the
   locale encoding.
I feel that the default encoding should be one whose decoder cannot
fail, e.g. ISO-8859-1. You should have to explicitly request the use
of the locale's encoding (analogous to calling setlocale(LC_CTYPE, "")
at the start of a C program; there's a good reason why C doesn't do
this without being explicitly told to).
So that any haskell program that doesn't call setlocale and outputs
anything else than US-ASCII will produce garbage on an UTF-8 system?
No. If a program just passes bytes around, everything will work so
long as the inputs use the encoding which the outputs are assumed to
use. And if the inputs aren't in the "correct" encoding, then you have
to deal with encodings manually regardless of the default behaviour.
...
...
Actually, the more I think about it, the more I think that "simple,
stupid programs" probably shouldn't be using Unicode at all.
Care to give any examples?  Everything that has been mentioned until
now would break with an UTF-8 locale:
    - ls (sorting would break),
    - env (sorting too)
Sorting according to codepoints inevitably involves decoding. However,
getting the order wrong is usually considered less problematic than
failing outright.
...
...
I.e. Char, String, string literals, and the I/O functions in
Prelude, IO etc should all be using bytes,
I don't want the same mess as in C, where strings and raw data are the
very same.
Tough. You already have it, and will do for the foreseeable future. 
Many existing APIs (including the core Unix API), protocols and file
formats are defined in terms of byte strings with no encoding
specified or implied.
...
Haskell has a nice type system and nicely defined types
for binary data ([Word8]) and for Strings (String), why don't use it?
I'd like to. But many of the functions which provide or accept binary
data (e.g. FilePath) insist on represent it using Strings.
...
...
with a distinct wide-character API available for people who want to
make the (substantial) effort involved in writing (genuinely)
internationalised programs.
If you introduce an entirely new "i18n-only" API, then it'll surely
become difficult. :-)
I18N is inherently difficult. Lots of textual data exists in lots of
different encodings, and the encoding is frequently unspecified.

It would be easier if we had a closed world where only one encoding
was ever used. But we don't, and pretending that we do doesn't make it
so.
...
...
Anything that isn't ISO-8859-1 just doesn't work for the most part,
and anyone who wants to provide real I18N first has to work around
the pseudo-I18N that's already there (e.g. convert Chars back into
Word8s so that they can decode them into real Chars).
One more reason to fix the I/O functions to handle encodings and have
a seperate/underlying binary I/O API.
The problem is that we also need to fix them to handle *no encoding*.

Also, binary data and text aren't disjoint. Everything is binary; some
of it is *also* text.
...
...
Oh, and because bytes are being stored in Chars, the type system won't
help if you neglect to decode a string, or if you decode it twice.
Yes, that's the problem with the current approach, i.e. that there's
no easy way get a list of Word8's out of a handle.
Or out of getDirectoryContents, getArgs, getEnv etc. Or to pass a list
of Word8s to a handle, or to openFile, getEnv etc.
...
...
...
The current ISO-8859-1 assumption is also wrong. A program written in
Haskell which sorts strings would break for non-ASCII letters even now
that they are ISO-8859-2 unless specified otherwise.
1. In that situation, you can't avoid the encoding issues. It doesn't
matter what the default is, because you're going to have to set the
encoding anyhow.
Why do you always want me to set the encoding?  That should be the job
of the RTS.
Because you might know the encoding, and the RTS doesn't. The locale
is a fallback mechanism, for the situation where you *need* an
encoding but one hasn't been specified by other means.
...
...
2. If you assume ISO-8859-1, you can always convert back to Word8
If I want a list of Word8's, then I should be able to get them without
extracting them from a string.
The point is that, currently, you can't. Nothing in the core Haskell98
API actually uses Word8, it all uses Char/String.
...
...
then re-decode as UTF-8. If you assume UTF-8, anything which is neither
UTF-8 nor ASCII will fail far more severely than just getting the
collation order wrong.
If I use String's to handle binary data, then I should expect things
to break.  If I want to get text, and it's not in the expected
encoding, then the user has messed up.
Or maybe the expectation is incorrect.
...
...
Well, my view is essentially that files should be treated as
containing bytes unless you explicitly choose to decode them, at
which point you have to specify the encoding.
Why do you always want me to _manually_ specify an encoding?
Because we don't have an "oracle" which will magically determine the
encoding for you.
...
If I
want bytes, I'll use the (currently being discussed, see beginning of
this thread) binary I/O API, if I want String's (i.e. text), I'll use
the current I/O API (which is pretty text-orientated anyway, see
hPutStrLn, hGetLine, ...).
If you want text, well, tough; what comes out most system calls and
core library functions (not just read()) are bytes. There isn't any
magic wand which will turn them into characters without knowing the
encoding.
...
...
completely new wide-character API for those who wish to use it.
Which would make it horrendously difficult to do even basic I18N.
Why?
...
...
That gets the failed attempt at I18N out of everyone's way with a
minimum of effort and with maximum backwards compatibility for
existing code.
If existing code, expects String's to be just a list of bytes, it's
_broken_.
I know. That's what I'm saying. The problem is that the broken "code"
is the Haskell98 API.
...
String's are a list of unicode characters, [Word8] is a
list of bytes.
And what comes out of (and goes into) most core library functions is
the latter.

-- 
Glynn Clements