
Marcin 'Qrczak' Kowalczyk wrote:
If you provide "wrapper" functions which take String arguments, either they should have an encoding argument or the encoding should be a mutable per-terminal setting.
There is already a mutable setting. It's called "locale".
It isn't a per-terminal setting.
It is possible for curses to be used with a terminal which doesn't use the locale's encoding.
No, it will break under the new wide character curses API,
Or expose the fact that the WC API is broken, depending upon your POV.
and it will confuse programs which use the old narrow character API.
It has no effect on the *byte* API. Characters don't come into it.
The user (or the administrator) is responsible for matching the locale encoding with the terminal encoding.
Which is rather hard to do if you have multiple encodings.
Also, it's quite common to use non-standard encodings with terminals (e.g. codepage 437, which has graphic characters beyond the ACS_* set which terminfo understands).
curses don't support that.
Sure it does. You pass the appropriate bytes to waddstr() etc and they get sent to the terminal as-is. Curses doesn't have ACS_* macros for those characters, but it doesn't mean that you can't use them.
The locale encoding is the right encoding to use for conversion of the result of strerror, gai_strerror, msg member of gzip compressor state etc. When an I/O error occurs and the error code is translated to a Haskell exception and then shown to the user, why would the application need to specify the encoding and how?
Because the application may be using multiple locales/encodings.
But strerror always returns messages in the locale encoding.
Sorry, I misread that paragraph. I replied to "why would ..." without thinking about the context. When you know that a string is in the locale's encoding, you need to use it for the conversion. In that case you need to do the conversion (or at least record the actual encoding) immediately, in case the locale gets switched.
Just like Gtk+2 always accepts texts in UTF-8.
Unfortunately. The text probably originated in an encoding other than UTF-8, and will probably end up getting displayed using a font which is indexed using the original encoding (rather than e.g. UCS-2/4). Converting to Unicode then back again just introduces the potential for errors. [Particularly for CJK where, due to Han unification, Chinese characters may mutate into Japanese characters, or vice-versa. Fortunately, that doesn't seem to have started any wars. Yet.]
For compatibility the default locale is "C", but new programs which are prepared for I18N should do setlocale(LC_CTYPE, "") and setlocale(LC_MESSAGES, "").
In practice, you end up continuously calling setlocale(LC_CTYPE, "") and setlocale(LC_CTYPE, "C"), depending upon whether the text is meant to be human-readable (locale-dependent) or a machine-readable format (locale-independent, i.e. "C" locale).
[The most common example is printf("%f"). You need to use the C locale (decimal point) for machine-readable text but the user's locale (locale-specific decimal separator) for human-readable text.
This is a different thing, and it is what IMHO C did wrong.
It's a different example of the same problem. I agree that C did it wrong; I'm objecting to the implication that Haskell should make the same mistakes.
This isn't directly related to encodings per se, but a good example of why parameters are preferable to state.]
The LC_* environment variables are the parameters for the encoding.
But they are only really "parameters" at the exec() level. Once the program starts, the locale settings become global mutable state. I would have thought that, more than anyone else, the readership of this list would understand what's bad about that concept.
There is no other convention to pass the encoding to be used for textual output to stdout for example.
That's up to the application. Environment variables are a convenience; there's no reason why you can't have a command-line switch to select the encoding. For more complex applications, you often have user-selectable options and/or encodings specified in the data which you handle. Another problem with having a single locale: if a program isn't working, and you need to communicate with its developers, you will often have to run the program in an English locale just so that you will get error messages which the developers understand.
C libraries which use the locale do so as a last resort.
No, they do it by default.
By default, libc uses the C locale. setlocale() includes a convenience option to use the LC_* variables. Other libraries may or may not use the locale settings, and plenty of code will misbehave if the locale is wrong (e.g. using fprintf("%f") without explicitly setting the C locale first will do the wrong thing if you're trying to generate VRML/DXF/whatever files). Beyond that, libc uses the locale mechanism because it was the simplest way to retrofit minimal I18N onto K&R C. It also means that most code can easily duck the issues (i.e. so you don't have to pass a locale parameter to isupper() etc). OTOH, if you don't want to duck the issue, global locale settings are a nuisance.
The only reason that the C locale mechanism isn't a major nuisance is that you can largely ignore it altogether.
Then how would a Haskell program know what encoding to use for stdout messages?
It doesn't necessarily need to. If you are using message catalogues, you just read bytes from the catalogue and write them to stdout. The issue then boils down to using the correct encoding for the catalogues; the code doesn't need to know.
How would it know how to interpret filenames for graphical display?
An option menu on the file selector is one option; heuristics are another. Both tend to produce better results in non-trivial cases than either of Gtk-2's choices: i.e. filenames must be either UTF-8 or must match the locale (depending up the G_BROKEN_FILENAMES setting), otherwise the filename simply doesn't exist. At least Gtk-1 would attempt to display the filename; you would get the odd question mark but at least you could select the file; ultimately, the returned char* just gets passed to open(), so the encoding only really matters for display.
Code which requires real I18N can use other mechanisms, and code which doesn't require any I18N can just pass byte strings around and leave encoding issues to code which actually has enough context to handle them correctly.
Haskell can't just pass byte strings around without turning the Unicode support into a joke (which it is now).
If you try to pretend that I18N comes down to shoe-horning everything
into Unicode, you will turn the language into a joke.
Haskell's Unicode support is a joke because the API designers tried to
avoid the issues related to encoding with wishful thinking (i.e. you
open a file and you magically get Unicode characters out of it).
The "current locale" mechanism is just a way of avoiding the issues as
much as possible when you can't get away with avoiding them
altogether.
Unicode has been described (accurately, IMHO) as "Esperanto for
computers". Both use the same approach to try to solve essentially the
same problem. And both will be about as successful in the long run.
--
Glynn Clements