
Hi, On Sun, Apr 18, 2010 at 10:53:22AM -0700, Judah Jacobson wrote:
Anyway, the short story is that I have to either hard-code the character set to something like utf-8, or ghc will start to behave really strange (for example, ghci would terminate immediately if you just *type* a non-ASCII character).
That sounds like it might be something to do with the haskeline package, which ghci uses for user interaction. Haskeline makes its own FFI calls to translate raw input bytes into Unicode Chars.
Oh, this may indeed be a second problem. However, the encoding problem itself also manifests in the `openTempFile001' test of the testsuite. For example, with an unpatched ghc-6.12, the test fails with the following output: =====> openTempFile001(normal) 1048 of 2375 [0, 38, 0] cd ./lib/IO && '/usr/obj/ports/ghc-6.12.2/ghc-6.12.2/inplace/bin/ghc-stage2' -fforce-recomp -dcore-lint -dcmm-lint -no-user-package-conf -dno-debug-output -o openTempFile001 openTempFile001.hs >openTempFil e001.comp.stderr 2>&1 cd ./lib/IO && ./openTempFile001 openTempFile001.run.stdout 2>openTempFile001.run.stderr Wrong exit code (expected 0 , actual 1 ) Stdout: Stderr: openTempFile001: ./test22236.txt: hClose: invalid argument (Illegal byte sequence) *** unexpected failure for openTempFile001(normal)
Can you elaborate further on what exactly the issue is with OpenBSD's locale support? In particular, there's several components used by Haskeline: - call set_locale(LC_CTYPE)
Problem number 1: set_locale(LC_CTYPE) fails (i.e. returns NULL) for any locale except `C` or `POSIX'. Did I mention that OpenBSD is really bad with locales? ;-)
- call nl_langinfo(CODESET)
Always returns `646' (ASCII). Duh.
- pass the resulting string (which should be, e.g., $LANG) to iconv_open
iconv_open appears to need the *codeset* name, not a complete locale. Note that OpenBSD uses GNU libiconv-1.13, which AFAIK differs from the one included in glibc. Even worse, I have to pass something like "UTF-8", whereas "UTF8" doesn't work.
- call iconv on user input (which may be malformed)
I wrote a little C program that does the following (some error checks omitted here): char *inp, &outp; size_t insz, outsz; unsigned char in[] = {0xa9, 0, 0, 0}; char out[512]; inp = in; outp = out; insz = sizeof(in); outsz = sizeof(out) - 1; setlocale(LC_CTYPE, ""); ic = iconv_open("", "UTF-32LE"); if (iconv(ic, &inp, &insz, &outp, &outsz) == -1) { ... bail out (perror() etc.) ... } iconv_close(ic); *outp = 0; puts(out); And it just doesn't work, regardless what I set LC_CTYPE to. The only way to get it printing the copyright symbol is to explicitely use "UTF-8" (or "ISO-8859-1" or something else that knows about that symbol) as the first argument to iconv_open().
Is the problem that setting $LC_ALL or $LANG has no effect on the string returned by nl_langinfo, so the translation fails?
Yes, see above.
If so, haskeline is supposed to output "?"s in that case, so there might be a bug in the package.
It fails (or rather: ghci fails, since I didn't yet do any separate haskeline tests) with the same error as the test mentioned above, with the difference that it fails on hPutChar instead of hClose for obvious reasons.
Finally, when you say you have to "hard-code the character set", are you talking about ghc, haskeline, the base library, or somewhere else?
I'm talking about libraries/base/GHC/IO/Encoding/Iconv.hs See? There just is no non-hackerish way to fix this (except of course improving locale support on OpenBSD, but that's beyond my scope currently). Ciao, Kili