
Gabriel Ebner wrote:
3. The default encoding is settable from Haskell, defaults to ISO-8859-1.
Agreed.
So every haskell program that did more than just passing raw bytes From stdin to stdout should decode the appropriate environment variables, and set the encoding by itself?
This statement is too restrictive. Passing bytes isn't limited to stdin->stdout, and there's no reason why setting the encoding needs to be any more involved than e.g. "setLocaleEncoding". If you change it to:
So every haskell program that did more than just passing raw bytes ... should ... set the encoding by itself?
then the answer is yes.
IMO that's too much of redundancy, the RTS should actually do that.
The RTS doesn't know the encoding. Assuming that the data will use the locale's encoding will be wrong too often.
There are limits to the extent to which this can be achieved. E.g. what happens if you set the encoding to UTF-8, then call getDirectoryContents for a directory which contains filenames which aren't valid UTF-8 strings?
Then you _seriously_ messed up. Your terminal would produce garbage, Nautilus would break, ...
Like so many other people, you're making an argument based upon fiction (specifically, that you have a closed world where everything always uses the same encoding) then deeming anyone who is unable to maintain the fiction to be "wrong".
5. The default encoding is settable from Haskell, defaults to the locale encoding.
I feel that the default encoding should be one whose decoder cannot fail, e.g. ISO-8859-1. You should have to explicitly request the use of the locale's encoding (analogous to calling setlocale(LC_CTYPE, "") at the start of a C program; there's a good reason why C doesn't do this without being explicitly told to).
So that any haskell program that doesn't call setlocale and outputs anything else than US-ASCII will produce garbage on an UTF-8 system?
No. If a program just passes bytes around, everything will work so long as the inputs use the encoding which the outputs are assumed to use. And if the inputs aren't in the "correct" encoding, then you have to deal with encodings manually regardless of the default behaviour.
Actually, the more I think about it, the more I think that "simple, stupid programs" probably shouldn't be using Unicode at all.
Care to give any examples? Everything that has been mentioned until now would break with an UTF-8 locale: - ls (sorting would break), - env (sorting too)
Sorting according to codepoints inevitably involves decoding. However, getting the order wrong is usually considered less problematic than failing outright.
I.e. Char, String, string literals, and the I/O functions in Prelude, IO etc should all be using bytes,
I don't want the same mess as in C, where strings and raw data are the very same.
Tough. You already have it, and will do for the foreseeable future. Many existing APIs (including the core Unix API), protocols and file formats are defined in terms of byte strings with no encoding specified or implied.
Haskell has a nice type system and nicely defined types for binary data ([Word8]) and for Strings (String), why don't use it?
I'd like to. But many of the functions which provide or accept binary data (e.g. FilePath) insist on represent it using Strings.
with a distinct wide-character API available for people who want to make the (substantial) effort involved in writing (genuinely) internationalised programs.
If you introduce an entirely new "i18n-only" API, then it'll surely become difficult. :-)
I18N is inherently difficult. Lots of textual data exists in lots of different encodings, and the encoding is frequently unspecified. It would be easier if we had a closed world where only one encoding was ever used. But we don't, and pretending that we do doesn't make it so.
Anything that isn't ISO-8859-1 just doesn't work for the most part, and anyone who wants to provide real I18N first has to work around the pseudo-I18N that's already there (e.g. convert Chars back into Word8s so that they can decode them into real Chars).
One more reason to fix the I/O functions to handle encodings and have a seperate/underlying binary I/O API.
The problem is that we also need to fix them to handle *no encoding*. Also, binary data and text aren't disjoint. Everything is binary; some of it is *also* text.
Oh, and because bytes are being stored in Chars, the type system won't help if you neglect to decode a string, or if you decode it twice.
Yes, that's the problem with the current approach, i.e. that there's no easy way get a list of Word8's out of a handle.
Or out of getDirectoryContents, getArgs, getEnv etc. Or to pass a list of Word8s to a handle, or to openFile, getEnv etc.
The current ISO-8859-1 assumption is also wrong. A program written in Haskell which sorts strings would break for non-ASCII letters even now that they are ISO-8859-2 unless specified otherwise.
1. In that situation, you can't avoid the encoding issues. It doesn't matter what the default is, because you're going to have to set the encoding anyhow.
Why do you always want me to set the encoding? That should be the job of the RTS.
Because you might know the encoding, and the RTS doesn't. The locale is a fallback mechanism, for the situation where you *need* an encoding but one hasn't been specified by other means.
2. If you assume ISO-8859-1, you can always convert back to Word8
If I want a list of Word8's, then I should be able to get them without extracting them from a string.
The point is that, currently, you can't. Nothing in the core Haskell98 API actually uses Word8, it all uses Char/String.
then re-decode as UTF-8. If you assume UTF-8, anything which is neither UTF-8 nor ASCII will fail far more severely than just getting the collation order wrong.
If I use String's to handle binary data, then I should expect things to break. If I want to get text, and it's not in the expected encoding, then the user has messed up.
Or maybe the expectation is incorrect.
Well, my view is essentially that files should be treated as containing bytes unless you explicitly choose to decode them, at which point you have to specify the encoding.
Why do you always want me to _manually_ specify an encoding?
Because we don't have an "oracle" which will magically determine the encoding for you.
If I want bytes, I'll use the (currently being discussed, see beginning of this thread) binary I/O API, if I want String's (i.e. text), I'll use the current I/O API (which is pretty text-orientated anyway, see hPutStrLn, hGetLine, ...).
If you want text, well, tough; what comes out most system calls and core library functions (not just read()) are bytes. There isn't any magic wand which will turn them into characters without knowing the encoding.
completely new wide-character API for those who wish to use it.
Which would make it horrendously difficult to do even basic I18N.
Why?
That gets the failed attempt at I18N out of everyone's way with a minimum of effort and with maximum backwards compatibility for existing code.
If existing code, expects String's to be just a list of bytes, it's _broken_.
I know. That's what I'm saying. The problem is that the broken "code" is the Haskell98 API.
String's are a list of unicode characters, [Word8] is a list of bytes.
And what comes out of (and goes into) most core library functions is
the latter.
--
Glynn Clements