invalid character encoding

John Goerzen

15 Mar 2005 15 Mar '05

1:38 a.m.

I've got some gzip (and Ian Lynagh's Inflate) code that breaks under the new hugs with: <handle>: IO.getContents: protocol error (invalid character encoding) What is going on, and how can I fix it? Thanks, John

Show replies by date

Ross Paterson

15 Mar 15 Mar

10:44 a.m.

On Mon, Mar 14, 2005 at 07:38:09PM -0600, John Goerzen wrote:

...

I've got some gzip (and Ian Lynagh's Inflate) code that breaks under the new hugs with:

<handle>: IO.getContents: protocol error (invalid character encoding)

What is going on, and how can I fix it?

A Haskell 98 Handle is a character stream, and doesn't support binary I/O. This would have bitten you sooner or later on systems that do CRLF conversion, but Hugs is now much stricter, because character streams now use the encoding determined by the current locale (for the C locale, that means ASCII only). You can select binary I/O using the openBinaryFile and hSetBinaryMode functions from System.IO. After that, the Chars you get from that Handle are actually bytes.

John Goerzen

2:12 p.m.

On Tue, Mar 15, 2005 at 10:44:28AM +0000, Ross Paterson wrote:

...

On Mon, Mar 14, 2005 at 07:38:09PM -0600, John Goerzen wrote:

...
I've got some gzip (and Ian Lynagh's Inflate) code that breaks under the new hugs with:

<handle>: IO.getContents: protocol error (invalid character encoding)

What is going on, and how can I fix it?

A Haskell 98 Handle is a character stream, and doesn't support binary I/O. This would have bitten you sooner or later on systems that do CRLF

Yes, probably so..

...

conversion, but Hugs is now much stricter, because character streams now use the encoding determined by the current locale (for the C locale, that means ASCII only).

Hmm, this seems to be completely undocumented. So yes, I'll try using openBinaryFile, but the docs I have seen still talk only about CRLF and ^Z. Anyway, I'm intrested in this new feature (I assume GHC 6.4 has it as well?) Would it, for instance, automatically convert from Latin-1 to UTF-16 on read, and the inverse on write? Or to/from UTF-8? Thanks, -- John

Ross Paterson

2:23 p.m.

On Tue, Mar 15, 2005 at 08:12:48AM -0600, John Goerzen wrote:

...

...
[...] but Hugs is now much stricter, because character streams now use the encoding determined by the current locale (for the C locale, that means ASCII only).

Hmm, this seems to be completely undocumented.

It's mentioned in the release history in the User's Guide, which refers to section 3.3 for (some) more details.

Ian Lynagh

16 Mar 16 Mar

3:54 a.m.

On Tue, Mar 15, 2005 at 10:44:28AM +0000, Ross Paterson wrote:

...

On Mon, Mar 14, 2005 at 07:38:09PM -0600, John Goerzen wrote:

...
I've got some gzip (and Ian Lynagh's Inflate) code that breaks under the new hugs with:

<handle>: IO.getContents: protocol error (invalid character encoding)

What is going on, and how can I fix it?

A Haskell 98 Handle is a character stream, and doesn't support binary I/O. This would have bitten you sooner or later on systems that do CRLF conversion, but Hugs is now much stricter, because character streams now use the encoding determined by the current locale (for the C locale, that means ASCII only).

Do you have a list of functions which behave differently in the new release to how they did in the previous release? (I'm not interested in changes that will affect only whether something compiles, not how it behaves given it compiles both before and after). Simons, Malcolm, are there any such functions in the new ghc/nhc98? Also, are you all agreed that the hugs interpretation of the report is correct, and thus ghc at least is buggy in this respect? (I'm afraid I haven't been able to test nhc98 yet). Finally, the hugs behaviour seems a little odd to me. The below shows 4 cases where iconv complains when asked to convert utf8 to utf8, but hugs only gives an error in one of them. In the others it just truncates the input. Is this really correct? It also seems to behave the same for me regardless of whether I export LC_CTYPE to en_GB.UTF-8 or C. Thanks Ian printf "\x00\x7F" > inp1 printf "\x00\x80" > inp2 printf "\x00\xC4" > inp3 printf "\xFF\xFF" > inp4 printf "\xb1\x41\x00\x03\x65\x6d\x70\x74\x79\x00\x03\x00\x00\x00\x00\x00" > inp5 echo 'main = do xs <- getContents; print xs' > run.hs for i in `seq 1 5`; do runhugs run.hs < inp$i; done for i in `seq 1 5`; do runghc6 run.hs < inp$i; done for i in `seq 1 5`; do echo $i; iconv -f utf8 -t utf8 < inp$i; done which gives me the following output: $ for i in `seq 1 5`; do runhugs run.hs < inp$i; done "\NUL\DEL" "\NUL" "\NUL" "" " Program error: <stdin>: IO.getContents: protocol error (invalid character encoding) $ for i in `seq 1 5`; do runghc6 run.hs < inp$i; done "\NUL\DEL" "\NUL\128" "\NUL\196" "\255\255" "\177A\NUL\ETXempty\NUL\ETX\NUL\NUL\NUL\NUL\NUL" $ for i in `seq 1 5`; do echo $i; iconv -f utf8 -t utf8 < inp$i; done 1 2 iconv: illegal input sequence at position 1 3 iconv: incomplete character or shift sequence at end of buffer 4 iconv: illegal input sequence at position 0 5 iconv: illegal input sequence at position 0 $

Ross Paterson

11:55 a.m.

On Wed, Mar 16, 2005 at 03:54:19AM +0000, Ian Lynagh wrote:

...

Do you have a list of functions which behave differently in the new release to how they did in the previous release? (I'm not interested in changes that will affect only whether something compiles, not how it behaves given it compiles both before and after).

I got lost in the negatives here. It affects all Haskell 98 primitives that do character I/O, or that exchange C strings with the C library. It doesn't affect functions added by the hierarchical libraries, i.e. those functions are safe only with the ASCII subset. (There is a vague plan to make Foreign.C.String conform to the FFI spec, which mandates locale-based encoding, and thus would change all those, but it's still up in the air.)

...

Finally, the hugs behaviour seems a little odd to me. The below shows 4 cases where iconv complains when asked to convert utf8 to utf8, but hugs only gives an error in one of them. In the others it just truncates the input. Is this really correct? It also seems to behave the same for me regardless of whether I export LC_CTYPE to en_GB.UTF-8 or C.

It's a bug: an unrecognized encoding at the end of the input was being ignored instead of triggering the exception. Now fixed in CVS (rev. 1.14 of src/char.c if anyone's backporting). It was an accident of this example that the behaviour in all locales was the same.

Duncan Coutts

1:09 p.m.

On Wed, 2005-03-16 at 11:55 +0000, Ross Paterson wrote:

...

On Wed, Mar 16, 2005 at 03:54:19AM +0000, Ian Lynagh wrote:

...
Do you have a list of functions which behave differently in the new release to how they did in the previous release? (I'm not interested in changes that will affect only whether something compiles, not how it behaves given it compiles both before and after).

I got lost in the negatives here. It affects all Haskell 98 primitives that do character I/O, or that exchange C strings with the C library.

It doesn't affect functions added by the hierarchical libraries, i.e. those functions are safe only with the ASCII subset. (There is a vague plan to make Foreign.C.String conform to the FFI spec, which mandates locale-based encoding, and thus would change all those, but it's still up in the air.)

Hmm. I'm not convinced that automatically converting to the current locale is the ideal behaviour (it'd certianly break all my programs!). Certainly a function for converting into the encoding of the current locale would be useful for may users but it's important to be able to know the encoding with certainty. For example some libraries (eg Gtk+) take all strings in UTF-8 irrespective of the current locale (it does locale-dependent conversions on IO etc but the internal representation is always UTF8). We do the conversion to UTF8 on the Haskell side and so produce a byte string which we marshal using the FFI CString functions. If the implementations get fixed to conform to the FFI spec, I suppose we could roll our own version of withCString that marshals [Word8] -> char*. Duncan

Duncan Coutts

1:16 p.m.

On Wed, 2005-03-16 at 13:09 +0000, Duncan Coutts wrote:

...

On Wed, 2005-03-16 at 11:55 +0000, Ross Paterson wrote:

...

...
It doesn't affect functions added by the hierarchical libraries, i.e. those functions are safe only with the ASCII subset. (There is a vague plan to make Foreign.C.String conform to the FFI spec, which mandates locale-based encoding, and thus would change all those, but it's still up in the air.)

Hmm. I'm not convinced that automatically converting to the current locale is the ideal behaviour (it'd certianly break all my programs!). Certainly a function for converting into the encoding of the current locale would be useful for may users but it's important to be able to know the encoding with certainty. For example some libraries (eg Gtk+) take all strings in UTF-8 irrespective of the current locale (it does locale-dependent conversions on IO etc but the internal representation is always UTF8). We do the conversion to UTF8 on the Haskell side and so produce a byte string which we marshal using the FFI CString functions.

Silly me! There are C marshaling functions that are specified to do just this but I never noticed them before! withCAString and similar functions treat haskell Strings as byte strings. Duncan

Marcin 'Qrczak' Kowalczyk

1:16 p.m.

Duncan Coutts writes:

...

...
It doesn't affect functions added by the hierarchical libraries, i.e. those functions are safe only with the ASCII subset. (There is a vague plan to make Foreign.C.String conform to the FFI spec, which mandates locale-based encoding, and thus would change all those, but it's still up in the air.)

Hmm. I'm not convinced that automatically converting to the current locale is the ideal behaviour (it'd certianly break all my programs!). Certainly a function for converting into the encoding of the current locale would be useful for may users but it's important to be able to know the encoding with certainty.

It should only be the default, not the only option. It should be possible to specify the encoding explicitly. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Glynn Clements

5:13 p.m.

Marcin 'Qrczak' Kowalczyk wrote:

...

...
...
It doesn't affect functions added by the hierarchical libraries, i.e. those functions are safe only with the ASCII subset. (There is a vague plan to make Foreign.C.String conform to the FFI spec, which mandates locale-based encoding, and thus would change all those, but it's still up in the air.)

Hmm. I'm not convinced that automatically converting to the current locale is the ideal behaviour (it'd certianly break all my programs!). Certainly a function for converting into the encoding of the current locale would be useful for may users but it's important to be able to know the encoding with certainty.

It should only be the default, not the only option.

I'm not sure that it should be available at all.

...

It should be possible to specify the encoding explicitly.

Conversely, it shouldn't be possible to avoid specifying the encoding explicitly. Personally, I wouldn't provide an all-in-one "convert String to CString using locale's encoding" function, just in case anyone was tempted to actually use it. The decision as to the encoding belongs in application code; not in (most) libraries, and definitely not in the language. [Libraries dealing with file formats or communication protocols which mandate a specific encoding are an exception. But they will be using a fixed encoding, not the locale's encoding.] If application code chooses to use the locale's encoding, it can retrieve it then pass it as the encoding argument to any applicable functions. If application code doesn't want to use the locale's encoding, it shouldn't be shoe-horned into doing so because a library developer decided to duck the encoding issues by grabbing whatever encoding was readily to hand (i.e. the locale's encoding). -- Glynn Clements

Marcin 'Qrczak' Kowalczyk

5:50 p.m.

Glynn Clements writes:

...

...
It should be possible to specify the encoding explicitly.

Conversely, it shouldn't be possible to avoid specifying the encoding explicitly.

What encoding should a binding to readline or curses use? Curses in C comes in two flavors: the traditional byte version and a wide character version. The second version is easy if we can assume that wchar_t is Unicode, but it's not always available and until recently in ncurses it was buggy. Let's assume we are using the byte version. How to encode strings? A terminal uses an ASCII-compatible encoding. Wide character version of curses convert characters to the locale encoding, and byte version passes bytes unchanged. This means that if a Haskell binding to the wide character version does the obvious thing and passes Unicode directly, then an equivalent behavior can be obtained from the byte version (only limited to 256-character encodings) by using the locale encoding. The locale encoding is the right encoding to use for conversion of the result of strerror, gai_strerror, msg member of gzip compressor state etc. When an I/O error occurs and the error code is translated to a Haskell exception and then shown to the user, why would the application need to specify the encoding and how?

...

If application code doesn't want to use the locale's encoding, it shouldn't be shoe-horned into doing so because a library developer decided to duck the encoding issues by grabbing whatever encoding was readily to hand (i.e. the locale's encoding).

If a C library is written with the assumption that texts are in the locale encoding, a Haskell binding to such library should respect that assumption. Only some libraries allow to work with different, explicitly specified encodings. Many libraries don't, especially if the texts are not the core of the library functionality but error messages. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Glynn Clements

17 Mar 17 Mar

7:25 p.m.

Marcin 'Qrczak' Kowalczyk wrote:

...

Glynn Clements writes:

...
...
It should be possible to specify the encoding explicitly.

Conversely, it shouldn't be possible to avoid specifying the encoding explicitly.

What encoding should a binding to readline or curses use?

Curses in C comes in two flavors: the traditional byte version and a wide character version. The second version is easy if we can assume that wchar_t is Unicode, but it's not always available and until recently in ncurses it was buggy. Let's assume we are using the byte version. How to encode strings?

The (non-wchar) curses API functions take byte strings (char*), so the Haskell bindings should take CString or [Word8] arguments. If you provide "wrapper" functions which take String arguments, either they should have an encoding argument or the encoding should be a mutable per-terminal setting.

...

A terminal uses an ASCII-compatible encoding. Wide character version of curses convert characters to the locale encoding, and byte version passes bytes unchanged. This means that if a Haskell binding to the wide character version does the obvious thing and passes Unicode directly, then an equivalent behavior can be obtained from the byte version (only limited to 256-character encodings) by using the locale encoding.

I don't know enough about the wchar version of curses to comment on that. I do know that, to work reliably, the normal (byte) version of curses needs to pass "printable" bytes through unmodified. It is possible for curses to be used with a terminal which doesn't use the locale's encoding. Specifically, a single process may use curses with multiple terminals with differing encodings, e.g. an airport public information system displaying information in multiple languages. Also, it's quite common to use non-standard encodings with terminals (e.g. codepage 437, which has graphic characters beyond the ACS_* set which terminfo understands).

...

The locale encoding is the right encoding to use for conversion of the result of strerror, gai_strerror, msg member of gzip compressor state etc. When an I/O error occurs and the error code is translated to a Haskell exception and then shown to the user, why would the application need to specify the encoding and how?

Because the application may be using multiple locales/encodings. Having had to do this in C (i.e. repeatedly calling setlocale() to select the correct encoding), I would much prefer to have been able to pass the locale as a parameter. [The most common example is printf("%f"). You need to use the C locale (decimal point) for machine-readable text but the user's locale (locale-specific decimal separator) for human-readable text. This isn't directly related to encodings per se, but a good example of why parameters are preferable to state.]

...

...
If application code doesn't want to use the locale's encoding, it shouldn't be shoe-horned into doing so because a library developer decided to duck the encoding issues by grabbing whatever encoding was readily to hand (i.e. the locale's encoding).

If a C library is written with the assumption that texts are in the locale encoding, a Haskell binding to such library should respect that assumption.

C libraries which use the locale do so as a last resort. K&R C completely ignored I18N issues. ANSI C added the locale mechanism to as a hack to provide minimal I18N support while maintaining backward compatibility and in a minimally-intrusive manner. The only reason that the C locale mechanism isn't a major nuisance is that you can largely ignore it altogether. Code which requires real I18N can use other mechanisms, and code which doesn't require any I18N can just pass byte strings around and leave encoding issues to code which actually has enough context to handle them correctly.

...

Only some libraries allow to work with different, explicitly specified encodings. Many libraries don't, especially if the texts are not the core of the library functionality but error messages.

And most such libraries just treat text as byte strings. They don't care about their interpretation, or even whether or not they are valid in the locale's encoding. -- Glynn Clements

Marcin 'Qrczak' Kowalczyk

8:01 p.m.

Glynn Clements writes:

...

The (non-wchar) curses API functions take byte strings (char*), so the Haskell bindings should take CString or [Word8] arguments.

Programmers will not want to use such interface. When they want to display a string, it will be in Haskell String type. And it prevents having a single Haskell interface which uses either the narrow or wide version of curses interface, depending on what is available.

...

If you provide "wrapper" functions which take String arguments, either they should have an encoding argument or the encoding should be a mutable per-terminal setting.

There is already a mutable setting. It's called "locale".

...

I don't know enough about the wchar version of curses to comment on that.

It uses wcsrtombs or eqiuvalents to display characters. And the reverse to interpret keystrokes.

...

It is possible for curses to be used with a terminal which doesn't use the locale's encoding.

No, it will break under the new wide character curses API, and it will confuse programs which use the old narrow character API. The user (or the administrator) is responsible for matching the locale encoding with the terminal encoding.

...

Also, it's quite common to use non-standard encodings with terminals (e.g. codepage 437, which has graphic characters beyond the ACS_* set which terminfo understands).

curses don't support that.

...

...
The locale encoding is the right encoding to use for conversion of the result of strerror, gai_strerror, msg member of gzip compressor state etc. When an I/O error occurs and the error code is translated to a Haskell exception and then shown to the user, why would the application need to specify the encoding and how?

Because the application may be using multiple locales/encodings.

But strerror always returns messages in the locale encoding. Just like Gtk+2 always accepts texts in UTF-8. For compatibility the default locale is "C", but new programs which are prepared for I18N should do setlocale(LC_CTYPE, "") and setlocale(LC_MESSAGES, ""). There are places where the encoding is settable independently, or stored explicitly. For them Haskell should have withCString / peekCString / etc. with an explicit encoding. And there are places which use the locale encoding instead of having a separate switch.

...

[The most common example is printf("%f"). You need to use the C locale (decimal point) for machine-readable text but the user's locale (locale-specific decimal separator) for human-readable text.

This is a different thing, and it is what IMHO C did wrong.

...

This isn't directly related to encodings per se, but a good example of why parameters are preferable to state.]

The LC_* environment variables are the parameters for the encoding. There is no other convention to pass the encoding to be used for textual output to stdout for example.

...

C libraries which use the locale do so as a last resort.

No, they do it by default.

...

The only reason that the C locale mechanism isn't a major nuisance is that you can largely ignore it altogether.

Then how would a Haskell program know what encoding to use for stdout messages? How would it know how to interpret filenames for graphical display? Do you want to invent a separate mechanism for communicating that, so that an administrator has to set up a dozen of environment variables and teach each program separately about the encoding it should assume by default? We had this mess 10 years ago, and parts of it are still alive until today - you must sometimes configure xterm or Emacs separately, but it's being more common that programs know to use the system-supplied setting and don't have to be configured separately.

...

Code which requires real I18N can use other mechanisms, and code which doesn't require any I18N can just pass byte strings around and leave encoding issues to code which actually has enough context to handle them correctly.

Haskell can't just pass byte strings around without turning the Unicode support into a joke (which it is now). -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Keean Schupke

9:47 p.m.

I cannot help feeling that all this multi-language support is a mess. All strings should be coded in a universal encoding (like UTF8) so that the code for a character is the same independant of locale. It seems stupid that the locale affects the character encodings... the code for an 'a' should be the same all over the world... as should the code for a particular japanese character. In other words the locale should have no affect on character encodings, it should select between multi-lingual error messages which are supplied as distinct strings for each region. While we may have to inter-operate with 'C' code, we could have a Haskell library that does things properly. Keean. Marcin 'Qrczak' Kowalczyk wrote:

...

Glynn Clements writes:

...
The (non-wchar) curses API functions take byte strings (char*), so the Haskell bindings should take CString or [Word8] arguments.

Programmers will not want to use such interface. When they want to display a string, it will be in Haskell String type.

And it prevents having a single Haskell interface which uses either the narrow or wide version of curses interface, depending on what is available.

...
If you provide "wrapper" functions which take String arguments, either they should have an encoding argument or the encoding should be a mutable per-terminal setting.

There is already a mutable setting. It's called "locale".

...
I don't know enough about the wchar version of curses to comment on that.

It uses wcsrtombs or eqiuvalents to display characters. And the reverse to interpret keystrokes.

...
It is possible for curses to be used with a terminal which doesn't use the locale's encoding.

No, it will break under the new wide character curses API, and it will confuse programs which use the old narrow character API.

The user (or the administrator) is responsible for matching the locale encoding with the terminal encoding.

...
Also, it's quite common to use non-standard encodings with terminals (e.g. codepage 437, which has graphic characters beyond the ACS_* set which terminfo understands).

curses don't support that.

...
...
The locale encoding is the right encoding to use for conversion of the result of strerror, gai_strerror, msg member of gzip compressor state etc. When an I/O error occurs and the error code is translated to a Haskell exception and then shown to the user, why would the application need to specify the encoding and how?

Because the application may be using multiple locales/encodings.

But strerror always returns messages in the locale encoding. Just like Gtk+2 always accepts texts in UTF-8.

For compatibility the default locale is "C", but new programs which are prepared for I18N should do setlocale(LC_CTYPE, "") and setlocale(LC_MESSAGES, "").

There are places where the encoding is settable independently, or stored explicitly. For them Haskell should have withCString / peekCString / etc. with an explicit encoding. And there are places which use the locale encoding instead of having a separate switch.

...
[The most common example is printf("%f"). You need to use the C locale (decimal point) for machine-readable text but the user's locale (locale-specific decimal separator) for human-readable text.

This is a different thing, and it is what IMHO C did wrong.

...
This isn't directly related to encodings per se, but a good example of why parameters are preferable to state.]

The LC_* environment variables are the parameters for the encoding. There is no other convention to pass the encoding to be used for textual output to stdout for example.

...
C libraries which use the locale do so as a last resort.

No, they do it by default.

...
The only reason that the C locale mechanism isn't a major nuisance is that you can largely ignore it altogether.

Then how would a Haskell program know what encoding to use for stdout messages? How would it know how to interpret filenames for graphical display?

Do you want to invent a separate mechanism for communicating that, so that an administrator has to set up a dozen of environment variables and teach each program separately about the encoding it should assume by default? We had this mess 10 years ago, and parts of it are still alive until today - you must sometimes configure xterm or Emacs separately, but it's being more common that programs know to use the system-supplied setting and don't have to be configured separately.

...
Code which requires real I18N can use other mechanisms, and code which doesn't require any I18N can just pass byte strings around and leave encoding issues to code which actually has enough context to handle them correctly.

Haskell can't just pass byte strings around without turning the Unicode support into a joke (which it is now).

Glynn Clements

18 Mar 18 Mar

1:34 a.m.

Marcin 'Qrczak' Kowalczyk wrote:

...

...
If you provide "wrapper" functions which take String arguments, either they should have an encoding argument or the encoding should be a mutable per-terminal setting.

There is already a mutable setting. It's called "locale".

It isn't a per-terminal setting.

...

...
It is possible for curses to be used with a terminal which doesn't use the locale's encoding.

No, it will break under the new wide character curses API,

Or expose the fact that the WC API is broken, depending upon your POV.

...

and it will confuse programs which use the old narrow character API.

It has no effect on the *byte* API. Characters don't come into it.

...

The user (or the administrator) is responsible for matching the locale encoding with the terminal encoding.

Which is rather hard to do if you have multiple encodings.

...

...
Also, it's quite common to use non-standard encodings with terminals (e.g. codepage 437, which has graphic characters beyond the ACS_* set which terminfo understands).

curses don't support that.

Sure it does. You pass the appropriate bytes to waddstr() etc and they get sent to the terminal as-is. Curses doesn't have ACS_* macros for those characters, but it doesn't mean that you can't use them.

...

...
...
The locale encoding is the right encoding to use for conversion of the result of strerror, gai_strerror, msg member of gzip compressor state etc. When an I/O error occurs and the error code is translated to a Haskell exception and then shown to the user, why would the application need to specify the encoding and how?

Because the application may be using multiple locales/encodings.

But strerror always returns messages in the locale encoding.

Sorry, I misread that paragraph. I replied to "why would ..." without thinking about the context. When you know that a string is in the locale's encoding, you need to use it for the conversion. In that case you need to do the conversion (or at least record the actual encoding) immediately, in case the locale gets switched.

...

Just like Gtk+2 always accepts texts in UTF-8.

Unfortunately. The text probably originated in an encoding other than UTF-8, and will probably end up getting displayed using a font which is indexed using the original encoding (rather than e.g. UCS-2/4). Converting to Unicode then back again just introduces the potential for errors. [Particularly for CJK where, due to Han unification, Chinese characters may mutate into Japanese characters, or vice-versa. Fortunately, that doesn't seem to have started any wars. Yet.]

...

For compatibility the default locale is "C", but new programs which are prepared for I18N should do setlocale(LC_CTYPE, "") and setlocale(LC_MESSAGES, "").

In practice, you end up continuously calling setlocale(LC_CTYPE, "") and setlocale(LC_CTYPE, "C"), depending upon whether the text is meant to be human-readable (locale-dependent) or a machine-readable format (locale-independent, i.e. "C" locale).

...

...
[The most common example is printf("%f"). You need to use the C locale (decimal point) for machine-readable text but the user's locale (locale-specific decimal separator) for human-readable text.

This is a different thing, and it is what IMHO C did wrong.

It's a different example of the same problem. I agree that C did it wrong; I'm objecting to the implication that Haskell should make the same mistakes.

...

...
This isn't directly related to encodings per se, but a good example of why parameters are preferable to state.]

The LC_* environment variables are the parameters for the encoding.

But they are only really "parameters" at the exec() level. Once the program starts, the locale settings become global mutable state. I would have thought that, more than anyone else, the readership of this list would understand what's bad about that concept.

...

There is no other convention to pass the encoding to be used for textual output to stdout for example.

That's up to the application. Environment variables are a convenience; there's no reason why you can't have a command-line switch to select the encoding. For more complex applications, you often have user-selectable options and/or encodings specified in the data which you handle. Another problem with having a single locale: if a program isn't working, and you need to communicate with its developers, you will often have to run the program in an English locale just so that you will get error messages which the developers understand.

...

...
C libraries which use the locale do so as a last resort.

No, they do it by default.

By default, libc uses the C locale. setlocale() includes a convenience option to use the LC_* variables. Other libraries may or may not use the locale settings, and plenty of code will misbehave if the locale is wrong (e.g. using fprintf("%f") without explicitly setting the C locale first will do the wrong thing if you're trying to generate VRML/DXF/whatever files). Beyond that, libc uses the locale mechanism because it was the simplest way to retrofit minimal I18N onto K&R C. It also means that most code can easily duck the issues (i.e. so you don't have to pass a locale parameter to isupper() etc). OTOH, if you don't want to duck the issue, global locale settings are a nuisance.

...

...
The only reason that the C locale mechanism isn't a major nuisance is that you can largely ignore it altogether.

Then how would a Haskell program know what encoding to use for stdout messages?

It doesn't necessarily need to. If you are using message catalogues, you just read bytes from the catalogue and write them to stdout. The issue then boils down to using the correct encoding for the catalogues; the code doesn't need to know.

...

How would it know how to interpret filenames for graphical display?

An option menu on the file selector is one option; heuristics are another. Both tend to produce better results in non-trivial cases than either of Gtk-2's choices: i.e. filenames must be either UTF-8 or must match the locale (depending up the G_BROKEN_FILENAMES setting), otherwise the filename simply doesn't exist. At least Gtk-1 would attempt to display the filename; you would get the odd question mark but at least you could select the file; ultimately, the returned char* just gets passed to open(), so the encoding only really matters for display.

...

...
Code which requires real I18N can use other mechanisms, and code which doesn't require any I18N can just pass byte strings around and leave encoding issues to code which actually has enough context to handle them correctly.

Haskell can't just pass byte strings around without turning the Unicode support into a joke (which it is now).

If you try to pretend that I18N comes down to shoe-horning everything into Unicode, you will turn the language into a joke. Haskell's Unicode support is a joke because the API designers tried to avoid the issues related to encoding with wishful thinking (i.e. you open a file and you magically get Unicode characters out of it). The "current locale" mechanism is just a way of avoiding the issues as much as possible when you can't get away with avoiding them altogether. Unicode has been described (accurately, IMHO) as "Esperanto for computers". Both use the same approach to try to solve essentially the same problem. And both will be about as successful in the long run. -- Glynn Clements

Marcin 'Qrczak' Kowalczyk

11:16 a.m.

Glynn Clements writes:

...

...
...
If you provide "wrapper" functions which take String arguments, either they should have an encoding argument or the encoding should be a mutable per-terminal setting.

There is already a mutable setting. It's called "locale".

It isn't a per-terminal setting.

A separate setting would force users to configure an encoding just for the purposes of Haskell programs, as if the configuration wasn't already too fragmented. It's unwise to propose a new standard when an existing standard works well enough.

...

...
...
It is possible for curses to be used with a terminal which doesn't use the locale's encoding.

No, it will break under the new wide character curses API,

Or expose the fact that the WC API is broken, depending upon your POV.

It's the only curses API which allows to write full-screen programs in UTF-8 mode.

...

...
...
Also, it's quite common to use non-standard encodings with terminals (e.g. codepage 437, which has graphic characters beyond the ACS_* set which terminfo understands).

curses don't support that.

Sure it does. You pass the appropriate bytes to waddstr() etc and they get sent to the terminal as-is.

It doesn't support that and it will switch the terminal mode to "user" encoding (which is usually ISO-8859-x) on a first occasion, e.g. after an ACS_* macro was used, or maybe even at initialization. curses support two families of encodings: the current locale encoding and ACS. The locale encoding may be UTF-8 (works only with wide character API).

...

...
For compatibility the default locale is "C", but new programs which are prepared for I18N should do setlocale(LC_CTYPE, "") and setlocale(LC_MESSAGES, "").

In practice, you end up continuously calling setlocale(LC_CTYPE, "") and setlocale(LC_CTYPE, "C"), depending upon whether the text is meant to be human-readable (locale-dependent) or a machine-readable format (locale-independent, i.e. "C" locale).

I wrote LC_TYPE, not LC_ALL. LC_TYPE doesn't affect %f formatting, it only affects the encoding of texts emitted by gettext (including strerror) and the meaning of isalpha, toupper etc.

...

...
The LC_* environment variables are the parameters for the encoding.

But they are only really "parameters" at the exec() level.

This is usually the right place to specify it. It's rare that they are even set separately for the given program - usually they are per-system or per-user.

...

Once the program starts, the locale settings become global mutable state. I would have thought that, more than anyone else, the readership of this list would understand what's bad about that concept.

You can treat it as immutable. Just don't call setlocale with different arguments again.

...

Another problem with having a single locale: if a program isn't working, and you need to communicate with its developers, you will often have to run the program in an English locale just so that you will get error messages which the developers understand.

You don't need to change LC_CTYPE for that. Just set LC_MESSAGES.

...

...
Then how would a Haskell program know what encoding to use for stdout messages?

It doesn't necessarily need to. If you are using message catalogues, you just read bytes from the catalogue and write them to stdout.

gettext uses the locale to choose the encoding. Messages are internally stored as UTF-8 but emitted in the locale encoding. You are using the semantics I'm advocating without knowing that...

...

...
How would it know how to interpret filenames for graphical display?

An option menu on the file selector is one option; heuristics are another.

Heuristics won't distinguish various ISO-8859-x from each other. An option menu on the file selector is user-unfriendly because users don't want to configure it for each program separately. They want to set it in one place and expect it to work everywhere. Currently there are two such places: the locale, and G_FILENAME_ENCODING (or older G_BROKEN_FILENAMES) for glib. It's unwise to introduce yet another convention, and it would be a horrible idea to make it per-program.

...

At least Gtk-1 would attempt to display the filename; you would get the odd question mark but at least you could select the file;

Gtk+2 also attempts to display the filename. It can be opened even though the filename has inconvertible characters escaped.

...

The "current locale" mechanism is just a way of avoiding the issues as much as possible when you can't get away with avoiding them altogether.

It's a way to communicate the encoding of the terminal, filenames, strerror, gettext etc.

...

Unicode has been described (accurately, IMHO) as "Esperanto for computers". Both use the same approach to try to solve essentially the same problem. And both will be about as successful in the long run.

Unicode has no viable competition. Esperanto had English. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Glynn Clements

7:52 p.m.

Marcin 'Qrczak' Kowalczyk wrote:

...

...
...
...
If you provide "wrapper" functions which take String arguments, either they should have an encoding argument or the encoding should be a mutable per-terminal setting.

There is already a mutable setting. It's called "locale".

It isn't a per-terminal setting.

A separate setting would force users to configure an encoding just for the purposes of Haskell programs, as if the configuration wasn't already too fragmented.

encoding <- localeEncoding Curses.setupTerm encoding handle Not a big deal.

...

It's unwise to propose a new standard when an existing standard works well enough.

Existing standard? The standard curses API deals with bytes; encodings don't come into it. AFAIK, the wide-character curses API isn't yet a standard.

...

...
...
...
It is possible for curses to be used with a terminal which doesn't use the locale's encoding.

No, it will break under the new wide character curses API,

Or expose the fact that the WC API is broken, depending upon your POV.

It's the only curses API which allows to write full-screen programs in UTF-8 mode.

All the more reason to fix it. And where does UTF-8 come into it? I would have expected it to use wide characters throughout.

...

...
...
...
Also, it's quite common to use non-standard encodings with terminals (e.g. codepage 437, which has graphic characters beyond the ACS_* set which terminfo understands).

curses don't support that.

Sure it does. You pass the appropriate bytes to waddstr() etc and they get sent to the terminal as-is.

It doesn't support that and it will switch the terminal mode to "user" encoding (which is usually ISO-8859-x) on a first occasion, e.g. after an ACS_* macro was used, or maybe even at initialization.

curses support two families of encodings: the current locale encoding and ACS. The locale encoding may be UTF-8 (works only with wide character API).

I'm talking about standard (XSI) curses, which will just pass printable (non-control) bytes straight to the terminal. If your terminal uses CP437 (or some other non-standard encoding), you can just pass the appropriate bytes to waddstr() etc and the corresponding characters will appear on the terminal. ACS_* codes are a completely separate issue; they allow you to use line graphics in addition to a full 8-bit character set (e.g. ISO-8859-1). If you only need ASCII text, you can use the other 128 codes for graphics characters and never use the ACS_* macros or the "acsc" capability.

...

...
...
For compatibility the default locale is "C", but new programs which are prepared for I18N should do setlocale(LC_CTYPE, "") and setlocale(LC_MESSAGES, "").

In practice, you end up continuously calling setlocale(LC_CTYPE, "") and setlocale(LC_CTYPE, "C"), depending upon whether the text is meant to be human-readable (locale-dependent) or a machine-readable format (locale-independent, i.e. "C" locale).

I wrote LC_TYPE, not LC_ALL. LC_TYPE doesn't affect %f formatting, it only affects the encoding of texts emitted by gettext (including strerror) and the meaning of isalpha, toupper etc.

Sorry, I'm confusing two cases here. With LC_CTYPE, the main reason for continuous switching is when using wcstombs(). printf() uses LC_NUMERIC, which is switched between the "C" locale and the user's locale.

...

...
Once the program starts, the locale settings become global mutable state. I would have thought that, more than anyone else, the readership of this list would understand what's bad about that concept.

You can treat it as immutable. Just don't call setlocale with different arguments again.

Which limits you to a single locale. If you are using the locale's encoding, that limits you to a single encoding.

...

...
Another problem with having a single locale: if a program isn't working, and you need to communicate with its developers, you will often have to run the program in an English locale just so that you will get error messages which the developers understand.

You don't need to change LC_CTYPE for that. Just set LC_MESSAGES.

I'm starting to think that you're misunderstanding on purpose. Again. The point is that a single program often generates multiple streams of text, possibly for different "audiences" (e.g. humans and machines). Different streams may require different conventions (encodings, numeric formats, collating orders), but may use the same functions. Those functions need to obtain the conventions from somewhere, and that means either parameters or state. Having dealt with state (libc's locale mechanism), I would rather have parameters.

...

...
...
Then how would a Haskell program know what encoding to use for stdout messages?

It doesn't necessarily need to. If you are using message catalogues, you just read bytes from the catalogue and write them to stdout.

gettext uses the locale to choose the encoding. Messages are internally stored as UTF-8 but emitted in the locale encoding.

It didn't use to be that way, but I can see why they would have changed it (a single catalogue for encoding variants of a given locale).

...

...
...
How would it know how to interpret filenames for graphical display?

An option menu on the file selector is one option; heuristics are another.

Heuristics won't distinguish various ISO-8859-x from each other.

You treat the locale's encoding as a heuristic. If it looks like ISO-8859-x, and the locale's encoding is ISO-8859-x, you use that. If it looks like Shift-JIS, you don't complain and give up just because the locale is UTF-8.

...

An option menu on the file selector is user-unfriendly because users don't want to configure it for each program separately. They want to set it in one place and expect it to work everywhere.

Nothing will work everywhere. An option menu allows the user to force the encoding for individual cases when whatever other mechanism(s) you use get it wrong. I've needed to use Mozilla's "View -> Character Encoding" menu enough times when the browser's guess turned out to be wrong (and blindly honouring the charset specified by HTTP's Content-Type: or HTML's META tags would be a disaster).

...

...
At least Gtk-1 would attempt to display the filename; you would get the odd question mark but at least you could select the file;

Gtk+2 also attempts to display the filename. It can be opened even though the filename has inconvertible characters escaped.

This isn't my experience; I just get messages like: Gtk-Message: The filename "\377.ppm" couldn't be converted to UTF-8. (try setting the environment variable G_FILENAME_ENCODING): Invalid byte sequence in conversion input and the filename is omitted altogether.

...

...
The "current locale" mechanism is just a way of avoiding the issues as much as possible when you can't get away with avoiding them altogether.

It's a way to communicate the encoding of the terminal, filenames, strerror, gettext etc.

It's *a* way, but it's not a very good way. It sucks when you can't apply a single convention to everything.

...

...
Unicode has been described (accurately, IMHO) as "Esperanto for computers". Both use the same approach to try to solve essentially the same problem. And both will be about as successful in the long run.

Unicode has no viable competition.

There are two viable alternatives. Byte strings with associated encodings and ISO-2022. In CJK environments, ISO-2022 is still far more widespread than UTF-8, and will likely remain so for the foreseeable future. And byte strings with associated encodings are probably still the most common of all. -- Glynn Clements

Marcin 'Qrczak' Kowalczyk

19 Mar 19 Mar

11:55 a.m.

Glynn Clements writes:

...

...
A separate setting would force users to configure an encoding just for the purposes of Haskell programs, as if the configuration wasn't already too fragmented.

encoding <- localeEncoding Curses.setupTerm encoding handle

In a properly configured system curses is always supposed to be used like this. That is, it can as well use the locale encoding directly, without complicating the API. I don't want to force to implement bindings like this, but to allow it, because it's a good default.

...

...
It's unwise to propose a new standard when an existing standard works well enough.

Existing standard? The standard curses API deals with bytes; encodings don't come into it. AFAIK, the wide-character curses API isn't yet a standard.

It's described in Single Unix Spec along with the narrow character version (but in an earlier version; the newest version doesn't describe curses at all). But I meant a standard for communicating the encoding of the terminal to programs. If programs are supposed check the locale to determine that, it can be done automatically by bindings to readline & curses.

...

...
...
Or expose the fact that the WC API is broken, depending upon your POV.

It's the only curses API which allows to write full-screen programs in UTF-8 mode.

All the more reason to fix it.

And where does UTF-8 come into it? I would have expected it to use wide characters throughout.

The wide character API works with any encoding. The narrow character API works only with encodings where one byte corresponds to one character. (In the wide character API wchar_t doesn't have to correspond to one character cell; combining characters are attached to base characters, and some characters are double-wide.)

...

I'm talking about standard (XSI) curses, which will just pass printable (non-control) bytes straight to the terminal. If your terminal uses CP437 (or some other non-standard encoding), you can just pass the appropriate bytes to waddstr() etc and the corresponding characters will appear on the terminal.

Which terminal uses CP437? Linux console doesn't, except temporarily after switching the mapping to builtin CP437 (but this state is not used by curses) or after loading CP437 as the user map (nobody does this, and it won't work properly with all characters from the range 0x80-0x9F anyway).

...

...
You can treat it as immutable. Just don't call setlocale with different arguments again.

Which limits you to a single locale. If you are using the locale's encoding, that limits you to a single encoding.

There is no support for changing the encoding of a terminal on the fly by programs running inside it.

...

The point is that a single program often generates multiple streams of text, possibly for different "audiences" (e.g. humans and machines). Different streams may require different conventions (encodings, numeric formats, collating orders), but may use the same functions.

A single program has a single stdout and a single filesystem. The contexts which use the locale encoding don't need multiple encodings. Multiple encodings are needed e.g. for exchanging data with other machines for the network, for reading contents of text files after the user has specified an encoding explicitly etc. In these cases an API with explicitly provided encoding should be used.

...

...
Gtk+2 also attempts to display the filename. It can be opened even though the filename has inconvertible characters escaped.

This isn't my experience; I just get messages like:

Gtk-Message: The filename "\377.ppm" couldn't be converted to UTF-8. (try setting the environment variable G_FILENAME_ENCODING): Invalid byte sequence in conversion input

and the filename is omitted altogether.

Works for me, e.g. in gedit-2.8.2. The filename is displayed with escapes like \377 and can be opened.

...

...
...
The "current locale" mechanism is just a way of avoiding the issues as much as possible when you can't get away with avoiding them altogether.

It's a way to communicate the encoding of the terminal, filenames, strerror, gettext etc.

It's *a* way, but it's not a very good way. It sucks when you can't apply a single convention to everything.

It's not so bad to justify inventing our own conventions and forcing users to configure the encoding of Haskell programs separately.

...

...
Unicode has no viable competition.

There are two viable alternatives. Byte strings with associated encodings and ISO-2022.

ISO-2022 is an insanely complicated brain-damaged mess. I know it's being used in some parts of the world, but the sooner it will die, the better. Byte strings with associated encodings coexist with Unicode and are being slowly replaced by it, by using UTF-8 as the encoding more often. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

David Roundy

12:09 p.m.

On Sat, Mar 19, 2005 at 12:55:54PM +0100, Marcin 'Qrczak' Kowalczyk wrote:

...

Glynn Clements writes:

...
The point is that a single program often generates multiple streams of text, possibly for different "audiences" (e.g. humans and machines). Different streams may require different conventions (encodings, numeric formats, collating orders), but may use the same functions.

A single program has a single stdout and a single filesystem. The contexts which use the locale encoding don't need multiple encodings.

That's not true, there could be many filesystems, each of which uses a different encoding for the filenames. In the case of removable media, this scenario isn't even unlikely. -- David Roundy http://www.darcs.net

Keean Schupke

4:06 p.m.

David Roundy wrote:

...

That's not true, there could be many filesystems, each of which uses a different encoding for the filenames. In the case of removable media, this scenario isn't even unlikely.

I agree - I can quite easily see the situation occuring where a student (say from japan) brings in a zip-disk or USB key formatted with a japanese filename encoding, that I need to read on my computer (with a UK locale). Also can different windows have different encodings? I might have a web browser (written in haskell?) running and have windows with several different encodings open at the same time, whist saving things on filesystems with differing encodings. Keean.

Mark Carroll

4:27 p.m.

On Sat, 19 Mar 2005, David Roundy wrote:

...

That's not true, there could be many filesystems, each of which uses a different encoding for the filenames. In the case of removable media, this scenario isn't even unlikely.

The nearest desktop machine to me right now has in its directory structure filesystems that use different encodings. So, yes, it's probably not all that rare. Mark. -- Haskell vacancies in Columbus, Ohio, USA: see http://www.aetion.com/jobs.html

Glynn Clements

3:04 p.m.

Marcin 'Qrczak' Kowalczyk wrote:

...

...
I'm talking about standard (XSI) curses, which will just pass printable (non-control) bytes straight to the terminal. If your terminal uses CP437 (or some other non-standard encoding), you can just pass the appropriate bytes to waddstr() etc and the corresponding characters will appear on the terminal.

Which terminal uses CP437?

Most software terminal emulators can use any encoding. Traditional comms packages tend to support this (including their own "VGA" font if necessary) because of its widespread use on BBSes which were targeted at MS-DOS systems. There exist hardware terminals (I can't name specific models, but I have seen them in use) which support this, specifically for use with MS-DOS systems.

...

Linux console doesn't, except temporarily after switching the mapping to builtin CP437 (but this state is not used by curses) or after loading CP437 as the user map (nobody does this, and it won't work properly with all characters from the range 0x80-0x9F anyway).

I *still* encounter programs written for the linux console which assume that the built-in CP437 font is being used (if you use an ISO-8859-1 font, you get dialogs with accented characters where you would expect line-drawing characters).

...

...
...
You can treat it as immutable. Just don't call setlocale with different arguments again.

Which limits you to a single locale. If you are using the locale's encoding, that limits you to a single encoding.

There is no support for changing the encoding of a terminal on the fly by programs running inside it.

If you support multiple terminals with different encodings, and the library uses the global locale settings to determine the encoding, you need to switch locale every time you write to a different terminal.

...

...
The point is that a single program often generates multiple streams of text, possibly for different "audiences" (e.g. humans and machines). Different streams may require different conventions (encodings, numeric formats, collating orders), but may use the same functions.

A single program has a single stdout and a single filesystem. The contexts which use the locale encoding don't need multiple encodings.

Multiple encodings are needed e.g. for exchanging data with other machines for the network, for reading contents of text files after the user has specified an encoding explicitly etc. In these cases an API with explicitly provided encoding should be used.

A API which is used for reading and writing text files or sockets is just as applicable to stdin/stdout.

...

...
...
...
The "current locale" mechanism is just a way of avoiding the issues as much as possible when you can't get away with avoiding them altogether.

It's a way to communicate the encoding of the terminal, filenames, strerror, gettext etc.

It's *a* way, but it's not a very good way. It sucks when you can't apply a single convention to everything.

It's not so bad to justify inventing our own conventions and forcing users to configure the encoding of Haskell programs separately.

I'm not suggesting inventing conventions. I'm suggesting leaving such issues to the application programmer who, unlike the library programmer, probably has enough context to be able to reliably determine the correct encoding in any specific instance.

...

...
...
Unicode has no viable competition.

There are two viable alternatives. Byte strings with associated encodings and ISO-2022.

ISO-2022 is an insanely complicated brain-damaged mess. I know it's being used in some parts of the world, but the sooner it will die, the better.

ISO-2022 has advantages and disadvantages relative to UTF-8. I don't want to go on about the specifics here because they aren't particularly relevant. What's relevant is that it isn't likely to disappear any time soon. A large part of the world already has a universal encoding which works well enough; they don't *need* UTF-8, and aren't going to rebuild their IT infrastructure from scratch for the sake of it. -- Glynn Clements

John Meacham

20 Mar 20 Mar

2:25 a.m.

On Sat, Mar 19, 2005 at 03:04:04PM +0000, Glynn Clements wrote:

...

I'm not suggesting inventing conventions. I'm suggesting leaving such issues to the application programmer who, unlike the library programmer, probably has enough context to be able to reliably determine the correct encoding in any specific instance.

But the whole point of Foreign.C.String is to interface to existing C code. And one of the most common conventions of said interfaces is to represent strings in the current locale, Which is why locale honoring conversion routines are useful. I don't think anyone is arguing that this is the end-all of charset conversion, far from it. A general conversion library and parameterized conversion routines are also needed for many of the reasons you said, and will probably appear at some point. I have my own iconv interface which I used for my initial implementation of with/peekCString etc. and I am sure other people have written their own, eventually one will be standardized. A general conversion facility has been on the wishlist for a long time. However, at the moment, the FFI is tackling a much simpler goal of interfacing with existing C code, and non-parameterized locale-honoring conversion routines are extremely useful for that. Even if we had a nice generalized conversion routine, a simple locale-honoring front end would be a very useful interface because it is so commonly needed when interfacing to C code. However, I am sure everyone would be happy if a nice cabalized general charset conversion library appeared... I have the start of one here, which should work on any POSIXy system, even if wchar_t is not unicode (no windows support though) http://repetae.net/john/recent/out/HsLocale.html John -- John Meacham - ⑆repetae.net⑆john⑈

Glynn Clements

21 Mar 21 Mar

10:27 p.m.

John Meacham wrote:

...

...
I'm not suggesting inventing conventions. I'm suggesting leaving such issues to the application programmer who, unlike the library programmer, probably has enough context to be able to reliably determine the correct encoding in any specific instance.

But the whole point of Foreign.C.String is to interface to existing C code. And one of the most common conventions of said interfaces is to represent strings in the current locale, Which is why locale honoring conversion routines are useful.

My point is that most C functions which accept or return char*s will work regardless of whether those char*s can be decoded according to the current locale. E.g. while (d = readdir(dir), d) { stat(d->d_name, &st); ... } will stat() every filename in the directory regardless of whether or not the filenames are valid in the locale's encoding. The Haskell equivalent using FilePath (i.e. String), getDirectoryContents etc currently only works because the char* <-> String conversions are hardcoded to ISO-8859-1, which is infallible and reversible. If it used e.g. UTF-8, it would fail on any filename which wasn't valid UTF-8 even though it never actually needs to know the string of characters which the filename represents. The same applies to reading filenames from argv[] and passing them to open() etc. This is one of the most common idioms in Unix programming, and it doesn't care about encodings at all. Again, it would cease to work reliably in Haskell if the automatic char* <-> String conversions in getArgs etc started using the locale. I'm not arguing about *how* char* <-> String conversions should be performed so much as arguing about *whether* these conversions should be performed. The conversion issues are only problems because the conversions are being done at all. -- Glynn Clements

John Meacham

16 Mar 16 Mar

11:36 p.m.

On Wed, Mar 16, 2005 at 05:13:25PM +0000, Glynn Clements wrote:

...

Marcin 'Qrczak' Kowalczyk wrote:

...
...
...
It doesn't affect functions added by the hierarchical libraries, i.e. those functions are safe only with the ASCII subset. (There is a vague plan to make Foreign.C.String conform to the FFI spec, which mandates locale-based encoding, and thus would change all those, but it's still up in the air.)

Hmm. I'm not convinced that automatically converting to the current locale is the ideal behaviour (it'd certianly break all my programs!). Certainly a function for converting into the encoding of the current locale would be useful for may users but it's important to be able to know the encoding with certainty.

It should only be the default, not the only option.

I'm not sure that it should be available at all.

...
It should be possible to specify the encoding explicitly.

Conversely, it shouldn't be possible to avoid specifying the encoding explicitly.

Personally, I wouldn't provide an all-in-one "convert String to CString using locale's encoding" function, just in case anyone was tempted to actually use it.

But this is exactly what is needed for most C library bindings. Which is why I had to write my own and proposed it to the FFI. Most C libraries expect char * to be in the standard encoding of the current locale. When a binding explicitly uses another encoding, then great, we can use different marshaling functions. In any case, we need tools to be able to conform to the common cases of ascii-only (withCAStrirg) and current locale (withCString). withUTF8String would be a nice addition, but is much less important to come standard as it can easily be written by end users, unlike locale specific versions which are necessarily system dependent. John -- John Meacham - ⑆repetae.net⑆john⑈

Marcin 'Qrczak' Kowalczyk

17 Mar 17 Mar

12:05 a.m.

John Meacham writes:

...

In any case, we need tools to be able to conform to the common cases of ascii-only (withCAStrirg) and current locale (withCString).

withUTF8String would be a nice addition, but is much less important to come standard as it can easily be written by end users, unlike locale specific versions which are necessarily system dependent.

IMHO the encoding should be a parameter of an extended variant of withCString (and peekCString etc.). We need a framework for implementing encoders/decoders first. A problem with designing the framework is that it should support both pure Haskell conversions and C functions like iconv which work on arrays. We must also provide a way to signal errors. A bonus is a way to handle errors coming from another recoder without causing it to fail completely. That way one could add a fallback for unrepresentable characters, e.g. HTML entities or approximations with stripped accents. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Glynn Clements

7:55 p.m.

John Meacham wrote:

...

...
...
...
...
It doesn't affect functions added by the hierarchical libraries, i.e. those functions are safe only with the ASCII subset. (There is a vague plan to make Foreign.C.String conform to the FFI spec, which mandates locale-based encoding, and thus would change all those, but it's still up in the air.)

Hmm. I'm not convinced that automatically converting to the current locale is the ideal behaviour (it'd certianly break all my programs!). Certainly a function for converting into the encoding of the current locale would be useful for may users but it's important to be able to know the encoding with certainty.

It should only be the default, not the only option.

I'm not sure that it should be available at all.

...
It should be possible to specify the encoding explicitly.

Conversely, it shouldn't be possible to avoid specifying the encoding explicitly.

Personally, I wouldn't provide an all-in-one "convert String to CString using locale's encoding" function, just in case anyone was tempted to actually use it.

But this is exactly what is needed for most C library bindings.

I very much doubt that "most" is accurate. C functions which take a "char*" fall into three main cases: 1. Unspecified encoding, i.e. it's a string of bytes, not characters. 2. Locale's encoding, as determined by nl_langinfo(CODESET); essentially, whatever was set with setlocale(LC_CTYPE), defaulting to C/POSIX if setlocale() hasn't been called. 3. Fixed encoding, e.g. UTF-8, ISO-2022, US-ASCII (or EBCDIC on IBM mainframes). Historically, library functions have tended to fall into category 1 unless they *need* to know the interpretation of a given byte or sequence of bytes (e.g. ), in which case they fall into category 2. Most of libc falls into category 1, with a minority of functions in category 2. Code which is designed to handle multiple languages simultaneously is more likely to fall into category 3, using one of the "universal" encodings (typically ISO-2022 in southeast Asia and UTF-8 elsewhere). E.g. Gtk-2.x uses UTF-8 almost exclusively, although you can force the use of the locale's encoding for filenames (if you have filenames in multiple encodings, you lose; filenames using the "wrong" encoding simply don't appear in file selectors). -- Glynn Clements

Marcin 'Qrczak' Kowalczyk

8:05 p.m.

Glynn Clements writes:

...

E.g. Gtk-2.x uses UTF-8 almost exclusively, although you can force the use of the locale's encoding for filenames (if you have filenames in multiple encodings, you lose; filenames using the "wrong" encoding simply don't appear in file selectors).

Actually they do appear, even though you can't type their names from the keyboard. The name shown in the GUI used to be escaped in different ways by different programs or even different places in one program (question marks, %hex escapes \oct escapes), but recently they added some functions to glib to make the behavior uniform. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Glynn Clements

18 Mar 18 Mar

2:20 a.m.

Marcin 'Qrczak' Kowalczyk wrote:

...

...
E.g. Gtk-2.x uses UTF-8 almost exclusively, although you can force the use of the locale's encoding for filenames (if you have filenames in multiple encodings, you lose; filenames using the "wrong" encoding simply don't appear in file selectors).

Actually they do appear, even though you can't type their names from the keyboard. The name shown in the GUI used to be escaped in different ways by different programs or even different places in one program (question marks, %hex escapes \oct escapes), but recently they added some functions to glib to make the behavior uniform.

In the last version of Gtk-2.x which I tried, "invalid" filenames are just omitted from the list. Gtk-1.x displayed them (I think with question marks, but it may have been a box). I've just tried with a more recent version (2.6.2); the default behaviour is similar, although you can now get around the issue by using G_FILENAME_ENCODING=ISO-8859-1. Of course, if your locale is a long way from ISO-8859-1, that isn't a particularly good solution. The best test case would be a system used predominantly by Japanese, where (apparently) it's common to have a mixture of both EUC-JP and Shift-JIS filenames (occasionally wrapped in ISO-2022, but usually raw). -- Glynn Clements

Ian Lynagh

19 Mar 19 Mar

7:14 p.m.

On Wed, Mar 16, 2005 at 11:55:18AM +0000, Ross Paterson wrote:

...

On Wed, Mar 16, 2005 at 03:54:19AM +0000, Ian Lynagh wrote:

...
Do you have a list of functions which behave differently in the new release to how they did in the previous release? (I'm not interested in changes that will affect only whether something compiles, not how it behaves given it compiles both before and after).

I got lost in the negatives here. It affects all Haskell 98 primitives that do character I/O, or that exchange C strings with the C library.

In the below, it looks like there is a bug in getDirectoryContents. Also, the error from w.hs is going to stdout, not stderr. Most importantly, though: is there any way to remove this file without doing something like an FFI import of unlink? Is there anything LC_CTYPE can be set to that will act like C/POSIX but accept 8-bit bytes as chars too? (in the POSIX locale) $ echo 'import Directory; main = getDirectoryContents "." >>= print' > q.hs $ runhugs q.hs [".","..","q.hs"] $ touch 1`printf "\xA2"` $ runhugs q.hs runhugs: Error occurred ERROR - Garbage collection fails to reclaim sufficient space $ echo 'import Directory; main = removeFile "1\xA2"' > w.hs $ runhugs w.hs Program error: 1?: Directory.removeFile: does not exist (file does not exist) $ strace -o strace.out runhugs w.hs > /dev/null $ grep unlink strace.out | head -c 14 | hexdump -C 00000000 75 6e 6c 69 6e 6b 28 22 31 3f 22 29 20 20 |unlink("1?") | 0000000e $ strace -o strace2.out rm 1* $ grep unlink strace2.out | head -c 14 | hexdump -C 00000000 75 6e 6c 69 6e 6b 28 22 31 a2 22 29 20 20 |unlink("1.") | 0000000e $ Now consider this e.hs: -------------------- import IO main = do hWaitForInput stdin 10000 putStrLn "Input is ready" r <- hReady stdin print r c <- hGetChar stdin print c putStrLn "Done!" -------------------- $ { printf "\xC2\xC2\xC2\xC2\xC2\xC2\xC2"; sleep 30; } | runhugs e.hs Input is ready True Program error: <stdin>: IO.hGetChar: protocol error (invalid character encoding) $ It takes 30 seconds for this error to be printed. This shows two issues: First of all, I think you should be giving an error as soon as you have a prefix that is the start of no character. Second, hReady now only guarantees hGetChar won't block on a binary mode handle, but I guess there is not much we can do except document that (short of some hideous hacks). Thanks Ian

ross＠soi.city.ac.uk

20 Mar 20 Mar

1:33 a.m.

On Sat, Mar 19, 2005 at 07:14:25PM +0000, Ian Lynagh wrote:

...

In the below, it looks like there is a bug in getDirectoryContents.

Yes, now fixed in CVS.

...

Also, the error from w.hs is going to stdout, not stderr.

It's a nuisance, but noone has got around to changing it.

...

Most importantly, though: is there any way to remove this file without doing something like an FFI import of unlink?

Is there anything LC_CTYPE can be set to that will act like C/POSIX but accept 8-bit bytes as chars too?

en_GB.iso88591 (or indeed any .iso88591 locale) will match the old behaviour (and the GHC behaviour). Indeed it's possible to have filenames (under POSIX, anyway) that H98 programs can't touch (under Hugs). That's pretty much follows from the Haskell definition FilePath = String. The other thread under this subject has touched on the need for an (additional) API using an abstract FilePath type.

...

Now consider this e.hs:

-------------------- import IO

main = do hWaitForInput stdin 10000 putStrLn "Input is ready" r <- hReady stdin print r c <- hGetChar stdin print c putStrLn "Done!" --------------------

$ { printf "\xC2\xC2\xC2\xC2\xC2\xC2\xC2"; sleep 30; } | runhugs e.hs Input is ready True

Program error: <stdin>: IO.hGetChar: protocol error (invalid character encoding) $

It takes 30 seconds for this error to be printed. This shows two issues: First of all, I think you should be giving an error as soon as you have a prefix that is the start of no character. Second, hReady now only guarantees hGetChar won't block on a binary mode handle, but I guess there is not much we can do except document that (short of some hideous hacks).

Yes, I don't see how to avoid this when using mbtowc() to do the conversion: it makes no distinction between a bad byte sequence and an incomplete one.

Ian Lynagh

4:34 a.m.

On Sun, Mar 20, 2005 at 01:33:44AM +0000, ross@soi.city.ac.uk wrote:

...

On Sat, Mar 19, 2005 at 07:14:25PM +0000, Ian Lynagh wrote:

...
Most importantly, though: is there any way to remove this file without doing something like an FFI import of unlink?

Is there anything LC_CTYPE can be set to that will act like C/POSIX but accept 8-bit bytes as chars too?

en_GB.iso88591 (or indeed any .iso88591 locale) will match the old behaviour (and the GHC behaviour).

This works for me with en_GB.iso88591 (or en_GB), but not en_US.iso88591 (or en_US). My /etc/locale.gen contains: en_GB ISO-8859-1 en_GB.ISO-8859-15 ISO-8859-15 en_GB.UTF-8 UTF-8 So is there anything that /always/ works?

...

Indeed it's possible to have filenames (under POSIX, anyway) that H98 programs can't touch (under Hugs). That's pretty much follows from the Haskell definition FilePath = String. The other thread under this subject has touched on the need for an (additional) API using an abstract FilePath type.

Hmm. I can't say I'm convinced by all this without having something like that API.

...

Yes, I don't see how to avoid this when using mbtowc() to do the conversion: it makes no distinction between a bad byte sequence and an incomplete one.

Perhaps you could use mbrtowc instead? My manpage says If the n bytes starting at s do not contain a complete multibyte char- acter, mbrtowc returns (size_t)(-2). This can happen even if n >= MB_CUR_MAX, if the multibyte string contains redundant shift sequences. If the multibyte string starting at s contains an invalid multibyte sequence before the next complete character, mbrtowc returns (size_t)(-1) and sets errno to EILSEQ. In this case, the effects on *ps are undefined. For both functions my manpage says CONFORMING TO ISO/ANSI C, UNIX98 Thanks Ian

Keean Schupke

12:59 p.m.

One thing I don't like about this automatic conversion is that it is hidden magic - and could catch people out. Let's say I don't want to use it... How can I do the following (ie what are the new API calls): Open a file with a name that is invalid in the current locale (say a zip disc from a computer with a different locale setting). Open a file with contents in an unknown encoding. What are the new binary API calls for file IO? What type is returned from 'getChar' on a binary file. Should it even be called getChar? what about getWord8 (getWord16, getWord32 etc...) Does the encoding translation occur just on the filename or the contents as well? What if I have an encoded filename with binary contents and vice-versa. Keean. (I guess I now have to rewrite a lot of file IO code!)

ross＠soi.city.ac.uk

6:42 p.m.

On Sun, Mar 20, 2005 at 12:59:52PM +0000, Keean Schupke wrote:

...

How can I do the following (ie what are the new API calls):

Open a file with a name that is invalid in the current locale (say a zip disc from a computer with a different locale setting).

A new API is needed for this.

...

Open a file with contents in an unknown encoding.

What are the new binary API calls for file IO?

see System.IO

...

What type is returned from 'getChar' on a binary file. Should it even be called getChar? what about getWord8 (getWord16, getWord32 etc...)

Char, of course. And yes, it's not ideal. There's also a byte array interface.

...

(I guess I now have to rewrite a lot of file IO code!)

If it was doing binary I/O on H98 Handles, it already needed rewriting. There's nothing to be done for filenames until a new API emerges.

Ross Paterson

21 Mar 21 Mar

10:35 a.m.

On Sun, Mar 20, 2005 at 04:34:12AM +0000, Ian Lynagh wrote:

...

On Sun, Mar 20, 2005 at 01:33:44AM +0000, ross@soi.city.ac.uk wrote:

...
On Sat, Mar 19, 2005 at 07:14:25PM +0000, Ian Lynagh wrote:

...
Is there anything LC_CTYPE can be set to that will act like C/POSIX but accept 8-bit bytes as chars too?

en_GB.iso88591 (or indeed any .iso88591 locale) will match the old behaviour (and the GHC behaviour).

This works for me with en_GB.iso88591 (or en_GB), but not en_US.iso88591 (or en_US). My /etc/locale.gen contains:

en_GB ISO-8859-1 en_GB.ISO-8859-15 ISO-8859-15 en_GB.UTF-8 UTF-8

So is there anything that /always/ works?

Since systems may have no locale other than C/POSIX, no.

...

...
Yes, I don't see how to avoid this when using mbtowc() to do the conversion: it makes no distinction between a bad byte sequence and an incomplete one.

Perhaps you could use mbrtowc instead?

Indeed. Thanks for pointing it out.

Ian Lynagh

17 Mar 17 Mar

6:22 a.m.

On Tue, Mar 15, 2005 at 10:44:28AM +0000, Ross Paterson wrote:

...

You can select binary I/O using the openBinaryFile and hSetBinaryMode functions from System.IO. After that, the Chars you get from that Handle are actually bytes.

What about the ones sent to it? Are all the following results intentional? Am I doing something stupid? [in brief: hugs' (hPutStr h) now behaves differently to (mapM_ (hPutChar h)), and ghc writes the empty string for both when told to write "\128"] Running the following with new ghc 6.4 and hugs 20050308 or 20050317: echo 'import System.IO; import System.Environment; main = do [o] <- getArgs; ho <- openBinaryFile o WriteMode; hPutStr ho "\128"' > run1.hs echo 'import System.IO; import System.Environment; main = do [o] <- getArgs; ho <- openBinaryFile o WriteMode; mapM_ (hPutChar ho) "\128"' > run2.hs runhugs run1.hs hugs1 runhugs run2.hs hugs2 runghc run1.hs ghc1 runghc run2.hs ghc2 ls -l hugs1 hugs2 ghc1 ghc2 for f in hugs1 hugs2 ghc1 ghc2; do echo $f; hexdump -C $f; done gives: -rw-r--r-- 1 igloo igloo 0 Mar 17 06:15 ghc1 -rw-r--r-- 1 igloo igloo 0 Mar 17 06:15 ghc2 -rw-r--r-- 1 igloo igloo 1 Mar 17 06:15 hugs1 -rw-r--r-- 1 igloo igloo 1 Mar 17 06:15 hugs2 hugs1 00000000 3f |?| 00000001 hugs2 00000000 80 |.| 00000001 ghc1 ghc2 With ghc 6.2.2 and hugs "November 2003" I get: -rw-r--r-- 1 igloo igloo 1 Mar 17 06:16 ghc1 -rw-r--r-- 1 igloo igloo 1 Mar 17 06:16 ghc2 -rw-r--r-- 1 igloo igloo 1 Mar 17 06:16 hugs1 -rw-r--r-- 1 igloo igloo 1 Mar 17 06:16 hugs2 hugs1 00000000 80 |.| 00000001 hugs2 00000000 80 |.| 00000001 ghc1 00000000 80 |.| 00000001 ghc2 00000000 80 |.| 00000001 Incidentally, "make check" in CVS hugs said: cd tests && sh testScript | egrep -v '^--( |-----)' ./../src/hugs +q -w -pHugs: static/mod154.hs < /dev/null expected stdout not matched by reality *** static/Loaded.output Fri Jul 19 22:41:51 2002 --- /tmp/runtest11949.3 Thu Mar 17 05:46:05 2005 *************** *** 1,2 **** ! Type :? for help Hugs:[Leaving Hugs] --- 1,3 ---- ! ERROR "static/mod154.hs" - Conflicting exports of entity "sort" ! *** Could refer to Data.List.sort or M.sort Hugs:[Leaving Hugs] Thanks Ian

Ian Lynagh

11:40 a.m.

On Thu, Mar 17, 2005 at 06:22:25AM +0000, Ian Lynagh wrote:

...

[in brief: hugs' (hPutStr h) now behaves differently to (mapM_ (hPutChar h)), and ghc writes the empty string for both when told to write "\128"]

Ah, Malcolm's commit messages have just reminded me of the finaliser changes requiring hflushes in new ghc, so it's just the hugs output that confuses me now. Thanks Ian

Ross Paterson

12:16 p.m.

On Thu, Mar 17, 2005 at 06:22:25AM +0000, Ian Lynagh wrote:

...

On Tue, Mar 15, 2005 at 10:44:28AM +0000, Ross Paterson wrote:

...
You can select binary I/O using the openBinaryFile and hSetBinaryMode functions from System.IO. After that, the Chars you get from that Handle are actually bytes.

What about the ones sent to it? Are all the following results intentional? Am I doing something stupid?

No, I was. Output primitives other than hPutChar were ignoring binary mode (and Hugs has more of these things as primitives than GHC does). Now fixed in CVS (rev. 1.95 of src/char.c).

Ross Paterson

1:35 p.m.

On Thu, Mar 17, 2005 at 06:22:25AM +0000, Ian Lynagh wrote:

...

Incidentally, "make check" in CVS hugs said:

cd tests && sh testScript | egrep -v '^--( |-----)' ./../src/hugs +q -w -pHugs: static/mod154.hs < /dev/null expected stdout not matched by reality *** static/Loaded.output Fri Jul 19 22:41:51 2002 --- /tmp/runtest11949.3 Thu Mar 17 05:46:05 2005 *************** *** 1,2 **** ! Type :? for help Hugs:[Leaving Hugs] --- 1,3 ---- ! ERROR "static/mod154.hs" - Conflicting exports of entity "sort" ! *** Could refer to Data.List.sort or M.sort Hugs:[Leaving Hugs]

This is a documented bug (though the notes in tests ought to mention this too).

7412

Age (days ago)

7418

Last active (days ago)

List overview

Download

38 comments

11 participants

participants (11)

David Roundy
Duncan Coutts
Glynn Clements
Ian Lynagh
John Goerzen
John Meacham
Keean Schupke
Marcin 'Qrczak' Kowalczyk
Mark Carroll
Ross Paterson
ross＠soi.city.ac.uk