
Glynn Clements
If you provide "wrapper" functions which take String arguments, either they should have an encoding argument or the encoding should be a mutable per-terminal setting.
There is already a mutable setting. It's called "locale".
It isn't a per-terminal setting.
A separate setting would force users to configure an encoding just for the purposes of Haskell programs, as if the configuration wasn't already too fragmented. It's unwise to propose a new standard when an existing standard works well enough.
It is possible for curses to be used with a terminal which doesn't use the locale's encoding.
No, it will break under the new wide character curses API,
Or expose the fact that the WC API is broken, depending upon your POV.
It's the only curses API which allows to write full-screen programs in UTF-8 mode.
Also, it's quite common to use non-standard encodings with terminals (e.g. codepage 437, which has graphic characters beyond the ACS_* set which terminfo understands).
curses don't support that.
Sure it does. You pass the appropriate bytes to waddstr() etc and they get sent to the terminal as-is.
It doesn't support that and it will switch the terminal mode to "user" encoding (which is usually ISO-8859-x) on a first occasion, e.g. after an ACS_* macro was used, or maybe even at initialization. curses support two families of encodings: the current locale encoding and ACS. The locale encoding may be UTF-8 (works only with wide character API).
For compatibility the default locale is "C", but new programs which are prepared for I18N should do setlocale(LC_CTYPE, "") and setlocale(LC_MESSAGES, "").
In practice, you end up continuously calling setlocale(LC_CTYPE, "") and setlocale(LC_CTYPE, "C"), depending upon whether the text is meant to be human-readable (locale-dependent) or a machine-readable format (locale-independent, i.e. "C" locale).
I wrote LC_TYPE, not LC_ALL. LC_TYPE doesn't affect %f formatting, it only affects the encoding of texts emitted by gettext (including strerror) and the meaning of isalpha, toupper etc.
The LC_* environment variables are the parameters for the encoding.
But they are only really "parameters" at the exec() level.
This is usually the right place to specify it. It's rare that they are even set separately for the given program - usually they are per-system or per-user.
Once the program starts, the locale settings become global mutable state. I would have thought that, more than anyone else, the readership of this list would understand what's bad about that concept.
You can treat it as immutable. Just don't call setlocale with different arguments again.
Another problem with having a single locale: if a program isn't working, and you need to communicate with its developers, you will often have to run the program in an English locale just so that you will get error messages which the developers understand.
You don't need to change LC_CTYPE for that. Just set LC_MESSAGES.
Then how would a Haskell program know what encoding to use for stdout messages?
It doesn't necessarily need to. If you are using message catalogues, you just read bytes from the catalogue and write them to stdout.
gettext uses the locale to choose the encoding. Messages are internally stored as UTF-8 but emitted in the locale encoding. You are using the semantics I'm advocating without knowing that...
How would it know how to interpret filenames for graphical display?
An option menu on the file selector is one option; heuristics are another.
Heuristics won't distinguish various ISO-8859-x from each other. An option menu on the file selector is user-unfriendly because users don't want to configure it for each program separately. They want to set it in one place and expect it to work everywhere. Currently there are two such places: the locale, and G_FILENAME_ENCODING (or older G_BROKEN_FILENAMES) for glib. It's unwise to introduce yet another convention, and it would be a horrible idea to make it per-program.
At least Gtk-1 would attempt to display the filename; you would get the odd question mark but at least you could select the file;
Gtk+2 also attempts to display the filename. It can be opened even though the filename has inconvertible characters escaped.
The "current locale" mechanism is just a way of avoiding the issues as much as possible when you can't get away with avoiding them altogether.
It's a way to communicate the encoding of the terminal, filenames, strerror, gettext etc.
Unicode has been described (accurately, IMHO) as "Esperanto for computers". Both use the same approach to try to solve essentially the same problem. And both will be about as successful in the long run.
Unicode has no viable competition. Esperanto had English. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/