
On Mon, 2008-02-25 at 21:49 +0000, Ross Paterson wrote:
On Mon, Feb 25, 2008 at 09:07:08PM +0000, Duncan Coutts wrote:
It's no use pretending that readFile returns Unicode, it just doesn't (except on Hugs which does it properly). GHC is not going to catch up on this any time soon.
On the contrary, it's the only way to stay sane. readFile does return Unicode, it just doesn't read UTF. Putting compensating bugs in the libraries is only going to make it harder for GHC to change.
True. If fact it'll never help because it's not specified what encoding it should use but we want to use one specific encoding. For printing to stdout we would want to use some future improved standard text handle but for reading .cabal files we're specifying that they are utf-8, irrespective of current locale.
If we open the files in binary mode we don't get the cr/lf line conversion on Windows and we'd have to do that ourselves. Perhaps that's the way to go.
I think we've been ignoring CRs in .cabal files ever since we had to deal with tar files built on Windows and unpacked on Unix.
So if we use files opened in binary mode and account for line end differences then this is portable and doesn't make it harder for GHC to switch text handles to use a more sensible encoding. I'll push patches to do this.
As for stdout/stderr we're just stuffed. We cannot reopen them in binary mode and hugs and ghc have different and incompatible behaviour. We either end up double encoding with hugs or not decoding with ghc. There is no single method that works with both. We'd have to switch on the system in use.
My suggestion is to just write Chars to these Handles, even though text handles in GHC currently only work in an ISO-8859-1 locale.
Well, it's not the locale we're in, it's if we restrict ourselves to only wanting to print ISO-8859-1 chars, and we know we need more than that.
That's what the other libraries in your program will be doing with those handles, and they're not wrong: the other way lies madness.
It doesn't actually change the fact that our error messages will print garbage when they include snippets of a .cabal file that contained non-ISO-8859-1 chars.
Is switching the standard text handles to UTF really an impossibly remote prospect?
I'm not sure really. Perhaps we can raise it on haskell-cafe and/or libraries. I think the resistance at GHC HQ is not the difficulty but the fear of breaking things and upsetting people. If there were an obvious consensus that fear might be allayed. Duncan