
Glynn Clements
3. The default encoding is settable from Haskell, defaults to ISO-8859-1.
Agreed.
So every haskell program that did more than just passing raw bytes From stdin to stdout should decode the appropriate environment variables, and set the encoding by itself? IMO that's too much of redundancy, the RTS should actually do that.
There are limits to the extent to which this can be achieved. E.g. what happens if you set the encoding to UTF-8, then call getDirectoryContents for a directory which contains filenames which aren't valid UTF-8 strings?
Then you _seriously_ messed up. Your terminal would produce garbage, Nautilus would break, ...
5. The default encoding is settable from Haskell, defaults to the locale encoding.
I feel that the default encoding should be one whose decoder cannot fail, e.g. ISO-8859-1. You should have to explicitly request the use of the locale's encoding (analogous to calling setlocale(LC_CTYPE, "") at the start of a C program; there's a good reason why C doesn't do this without being explicitly told to).
So that any haskell program that doesn't call setlocale and outputs anything else than US-ASCII will produce garbage on an UTF-8 system?
Actually, the more I think about it, the more I think that "simple, stupid programs" probably shouldn't be using Unicode at all.
Care to give any examples? Everything that has been mentioned until now would break with an UTF-8 locale: - ls (sorting would break), - env (sorting too)
I.e. Char, String, string literals, and the I/O functions in Prelude, IO etc should all be using bytes,
I don't want the same mess as in C, where strings and raw data are the very same. Haskell has a nice type system and nicely defined types for binary data ([Word8]) and for Strings (String), why don't use it?
with a distinct wide-character API available for people who want to make the (substantial) effort involved in writing (genuinely) internationalised programs.
If you introduce an entirely new "i18n-only" API, then it'll surely become difficult. :-)
Anything that isn't ISO-8859-1 just doesn't work for the most part, and anyone who wants to provide real I18N first has to work around the pseudo-I18N that's already there (e.g. convert Chars back into Word8s so that they can decode them into real Chars).
One more reason to fix the I/O functions to handle encodings and have a seperate/underlying binary I/O API.
Oh, and because bytes are being stored in Chars, the type system won't help if you neglect to decode a string, or if you decode it twice.
Yes, that's the problem with the current approach, i.e. that there's no easy way get a list of Word8's out of a handle.
The advantage of assuming ISO-8859-* is that the decoder can't fail; every possible stream of bytes is valid.
Assuming. But UTF-8 is not ISO-8859-*. When someday I change most of my files and filenames from ISO-8859-2 to UTF-8, and change the locale, the assumption will be wrong. I can't change that now, because too many programs would break.
The current ISO-8859-1 assumption is also wrong. A program written in Haskell which sorts strings would break for non-ASCII letters even now that they are ISO-8859-2 unless specified otherwise.
1. In that situation, you can't avoid the encoding issues. It doesn't matter what the default is, because you're going to have to set the encoding anyhow.
Why do you always want me to set the encoding? That should be the job of the RTS. It's ok to use a different API to get Strings instead of Word8's out of a handle, but _manually_ having to set the encoding? IIRC, Haskell is meant to be portable, and locale handling is pretty platform-dependent.
2. If you assume ISO-8859-1, you can always convert back to Word8
If I want a list of Word8's, then I should be able to get them without extracting them from a string.
then re-decode as UTF-8. If you assume UTF-8, anything which is neither UTF-8 nor ASCII will fail far more severely than just getting the collation order wrong.
If I use String's to handle binary data, then I should expect things to break. If I want to get text, and it's not in the expected encoding, then the user has messed up.
Well, my view is essentially that files should be treated as containing bytes unless you explicitly choose to decode them, at which point you have to specify the encoding.
Why do you always want me to _manually_ specify an encoding? If I want bytes, I'll use the (currently being discussed, see beginning of this thread) binary I/O API, if I want String's (i.e. text), I'll use the current I/O API (which is pretty text-orientated anyway, see hPutStrLn, hGetLine, ...).
completely new wide-character API for those who wish to use it.
Which would make it horrendously difficult to do even basic I18N.
That gets the failed attempt at I18N out of everyone's way with a minimum of effort and with maximum backwards compatibility for existing code.
If existing code, expects String's to be just a list of bytes, it's _broken_. String's are a list of unicode characters, [Word8] is a list of bytes.
My main concern is that someone will get sick of waiting and make the wrong "fix", i.e. keep the existing API but default to the locale's encoding,
That would be my choice and is in line with the Haskell spec. Binary I/O should have a completely different API.
so that every simple program then has to explicitly set it back to ISO-8859-1 to get reasonable worst-case behaviour.
Which would be just as bad as your "fix", which would require many programs to set the locale back to the environment setting, just to get sorting, accentuations, etc. right. Gabriel.