Output character encoding for ghc on OpenBSD

Matthias Kilian

18 Apr 2010 18 Apr '10

2:01 p.m.

Hi, as some of you may know, I'm working on an update of OpenBSDs ghc port to 6.12.2, currently chasing down the last remaining testsuite failures. Yesterday, I ran into a problem which I have a fix for, but only a really ugly fix, and I need some opinions of what users would prefer. The problem is that Haskell uses unicode characters internally (ghc itself uses UTF-32 internally, where the endianess depends on the architecture it's running on), and that any Haskell program (including ghc and ghci) has to convert between the internal representation and the actual locale settings of the system it's running on. Unfortunately, OpenBSD is really bad if it comes to locale support; the only supported locales are the C and the POSIX locales, so even if you set LC_ALL or LC_CTYPE to something like, for example, de_DE.iso88591, this would have no effect on OpenBSD. Anyway, the short story is that I have to either hard-code the character set to something like utf-8, or ghc will start to behave really strange (for example, ghci would terminate immediately if you just *type* a non-ASCII character). So what would you prefer? - Use utf-8 and only utf-8 (i.e. hardcoded)? - Use something like iso-8859-15 (hardcoded)? - Make it configurable via some non-standard environment variable (GHC_CODESET, for example). If so, what should be the default if the environment variable isn't set? Back to 7 bit (ASCII)? utf-8? Some of the latin variants? Your suggestions are appreciated. Thanks in advance. Ciao, Kili

Show replies by date

Judah Jacobson

18 Apr 18 Apr

5:53 p.m.

On Sun, Apr 18, 2010 at 7:01 AM, Matthias Kilian wrote:

...

Hi,

as some of you may know, I'm working on an update of OpenBSDs ghc port to 6.12.2, currently chasing down the last remaining testsuite failures. Yesterday, I ran into a problem which I have a fix for, but only a really ugly fix, and I need some opinions of what users would prefer.

The problem is that Haskell uses unicode characters internally (ghc itself uses UTF-32 internally, where the endianess depends on the architecture it's running on), and that any Haskell program (including ghc and ghci) has to convert between the internal representation and the actual locale settings of the system it's running on. Unfortunately, OpenBSD is really bad if it comes to locale support; the only supported locales are the C and the POSIX locales, so even if you set LC_ALL or LC_CTYPE to something like, for example, de_DE.iso88591, this would have no effect on OpenBSD.

Anyway, the short story is that I have to either hard-code the character set to something like utf-8, or ghc will start to behave really strange (for example, ghci would terminate immediately if you just *type* a non-ASCII character).

That sounds like it might be something to do with the haskeline package, which ghci uses for user interaction. Haskeline makes its own FFI calls to translate raw input bytes into Unicode Chars. Can you elaborate further on what exactly the issue is with OpenBSD's locale support? In particular, there's several components used by Haskeline: - call set_locale(LC_CTYPE) - call nl_langinfo(CODESET) - pass the resulting string (which should be, e.g., $LANG) to iconv_open - call iconv on user input (which may be malformed) Is the problem that setting $LC_ALL or $LANG has no effect on the string returned by nl_langinfo, so the translation fails? If so, haskeline is supposed to output "?"s in that case, so there might be a bug in the package. Finally, when you say you have to "hard-code the character set", are you talking about ghc, haskeline, the base library, or somewhere else? Best, -Judah

Matthias Kilian

6:22 p.m.

Hi, On Sun, Apr 18, 2010 at 10:53:22AM -0700, Judah Jacobson wrote:

...

...
Anyway, the short story is that I have to either hard-code the character set to something like utf-8, or ghc will start to behave really strange (for example, ghci would terminate immediately if you just *type* a non-ASCII character).

That sounds like it might be something to do with the haskeline package, which ghci uses for user interaction. Haskeline makes its own FFI calls to translate raw input bytes into Unicode Chars.

Oh, this may indeed be a second problem. However, the encoding problem itself also manifests in the `openTempFile001' test of the testsuite. For example, with an unpatched ghc-6.12, the test fails with the following output: =====> openTempFile001(normal) 1048 of 2375 [0, 38, 0] cd ./lib/IO && '/usr/obj/ports/ghc-6.12.2/ghc-6.12.2/inplace/bin/ghc-stage2' -fforce-recomp -dcore-lint -dcmm-lint -no-user-package-conf -dno-debug-output -o openTempFile001 openTempFile001.hs >openTempFil e001.comp.stderr 2>&1 cd ./lib/IO && ./openTempFile001 openTempFile001.run.stdout 2>openTempFile001.run.stderr Wrong exit code (expected 0 , actual 1 ) Stdout: Stderr: openTempFile001: ./test22236.txt: hClose: invalid argument (Illegal byte sequence) *** unexpected failure for openTempFile001(normal)

...

Can you elaborate further on what exactly the issue is with OpenBSD's locale support? In particular, there's several components used by Haskeline: - call set_locale(LC_CTYPE)

Problem number 1: set_locale(LC_CTYPE) fails (i.e. returns NULL) for any locale except `C` or `POSIX'. Did I mention that OpenBSD is really bad with locales? ;-)

...

- call nl_langinfo(CODESET)

Always returns `646' (ASCII). Duh.

...

- pass the resulting string (which should be, e.g., $LANG) to iconv_open

iconv_open appears to need the *codeset* name, not a complete locale. Note that OpenBSD uses GNU libiconv-1.13, which AFAIK differs from the one included in glibc. Even worse, I have to pass something like "UTF-8", whereas "UTF8" doesn't work.

...

- call iconv on user input (which may be malformed)

I wrote a little C program that does the following (some error checks omitted here): char *inp, &outp; size_t insz, outsz; unsigned char in[] = {0xa9, 0, 0, 0}; char out[512]; inp = in; outp = out; insz = sizeof(in); outsz = sizeof(out) - 1; setlocale(LC_CTYPE, ""); ic = iconv_open("", "UTF-32LE"); if (iconv(ic, &inp, &insz, &outp, &outsz) == -1) { ... bail out (perror() etc.) ... } iconv_close(ic); *outp = 0; puts(out); And it just doesn't work, regardless what I set LC_CTYPE to. The only way to get it printing the copyright symbol is to explicitely use "UTF-8" (or "ISO-8859-1" or something else that knows about that symbol) as the first argument to iconv_open().

...

Is the problem that setting $LC_ALL or $LANG has no effect on the string returned by nl_langinfo, so the translation fails?

Yes, see above.

...

If so, haskeline is supposed to output "?"s in that case, so there might be a bug in the package.

It fails (or rather: ghci fails, since I didn't yet do any separate haskeline tests) with the same error as the test mentioned above, with the difference that it fails on hPutChar instead of hClose for obvious reasons.

...

Finally, when you say you have to "hard-code the character set", are you talking about ghc, haskeline, the base library, or somewhere else?

I'm talking about libraries/base/GHC/IO/Encoding/Iconv.hs See? There just is no non-hackerish way to fix this (except of course improving locale support on OpenBSD, but that's beyond my scope currently). Ciao, Kili

Simon Marlow

19 Apr 19 Apr

1:57 p.m.

On 18/04/2010 19:22, Matthias Kilian wrote:

...

Hi,

On Sun, Apr 18, 2010 at 10:53:22AM -0700, Judah Jacobson wrote:

...
...
Anyway, the short story is that I have to either hard-code the character set to something like utf-8, or ghc will start to behave really strange (for example, ghci would terminate immediately if you just *type* a non-ASCII character).

That sounds like it might be something to do with the haskeline package, which ghci uses for user interaction. Haskeline makes its own FFI calls to translate raw input bytes into Unicode Chars.

Oh, this may indeed be a second problem. However, the encoding problem itself also manifests in the `openTempFile001' test of the testsuite. For example, with an unpatched ghc-6.12, the test fails with the following output:

=====> openTempFile001(normal) 1048 of 2375 [0, 38, 0] cd ./lib/IO&& '/usr/obj/ports/ghc-6.12.2/ghc-6.12.2/inplace/bin/ghc-stage2' -fforce-recomp -dcore-lint -dcmm-lint -no-user-package-conf -dno-debug-output -o openTempFile001 openTempFile001.hs>openTempFil e001.comp.stderr 2>&1 cd ./lib/IO&& ./openTempFile001openTempFile001.run.stdout 2>openTempFile001.run.stderr Wrong exit code (expected 0 , actual 1 ) Stdout:

Stderr: openTempFile001: ./test22236.txt: hClose: invalid argument (Illegal byte sequence)

*** unexpected failure for openTempFile001(normal)

A few of the tests in the test suite assume a UTF-8 locale, so you're probably falling foul of that. We could fix the tests - but we do want to test that the locale encoding is being respected in some way, so just adding hSetEncoding to those tests would be wrong. Or you could just make those tests an expected failure on OpenBSD for the time being. For the IO library, I expect you should default the encoding to Latin-1 on OpenBSD. Cheers, Simon

Matthias Kilian

2:06 p.m.

On Mon, Apr 19, 2010 at 02:57:00PM +0100, Simon Marlow wrote:

...

A few of the tests in the test suite assume a UTF-8 locale, so you're probably falling foul of that. We could fix the tests - but we do want to test that the locale encoding is being respected in some way, so just adding hSetEncoding to those tests would be wrong.

Nah, don't touch the tests because of this.

...

For the IO library, I expect you should default the encoding to Latin-1 on OpenBSD.

I've some (rather horrible) patch that tries to make sense out of LC_ALL or LC_CTYPE if set. And if it isn't set, I'm currently defaulting to 646//TRANSLIT (which is ASCII with translation of some non-ASCII characters to ASCII art, like `(c)' for \xa9). But Latin-1 may be a more usable default. Thanks for the suggestion. (No, I'm not going to send this patch to cvs-ghc, it's really too horrid). Ciao, Kili

5563

Age (days ago)

5564

Last active (days ago)

List overview

Download

4 comments

3 participants

participants (3)

Judah Jacobson
Matthias Kilian
Simon Marlow