Re: patch applied (cabal): First pass at parsing .cabal files as UTF8

26 Feb 2008

      On Mon, 2008-02-25 at 21:49 +0000, Ross Paterson wrote:
...
On Mon, Feb 25, 2008 at 09:07:08PM +0000, Duncan Coutts wrote:
...
It's no use pretending that readFile returns Unicode, it just doesn't
(except on Hugs which does it properly). GHC is not going to catch up on
this any time soon.
On the contrary, it's the only way to stay sane.  readFile does return
Unicode, it just doesn't read UTF.  Putting compensating bugs in the
libraries is only going to make it harder for GHC to change.
True.

If fact it'll never help because it's not specified what encoding it
should use but we want to use one specific encoding. For printing to
stdout we would want to use some future improved standard text handle
but for reading .cabal files we're specifying that they are utf-8,
irrespective of current locale.
...
...
If we open the files in binary mode we don't get the cr/lf line
conversion on Windows and we'd have to do that ourselves. Perhaps that's
the way to go.
I think we've been ignoring CRs in .cabal files ever since we had to
deal with tar files built on Windows and unpacked on Unix.
So if we use files opened in binary mode and account for line end
differences then this is portable and doesn't make it harder for GHC to
switch text handles to use a more sensible encoding. I'll push patches
to do this.
...
...
As for stdout/stderr we're just stuffed. We cannot reopen them in binary
mode and hugs and ghc have different and incompatible behaviour. We
either end up double encoding with hugs or not decoding with ghc. There
is no single method that works with both. We'd have to switch on the
system in use.
My suggestion is to just write Chars to these Handles, even though text
handles in GHC currently only work in an ISO-8859-1 locale.
Well, it's not the locale we're in, it's if we restrict ourselves to
only wanting to print ISO-8859-1 chars, and we know we need more than
that.
...
That's what the other libraries in your program will be doing with
those handles, and they're not wrong: the other way lies madness.
It doesn't actually change the fact that our error messages will print
garbage when they include snippets of a .cabal file that contained
non-ISO-8859-1 chars.
...
Is switching the standard text handles to UTF really an impossibly
remote prospect?
I'm not sure really. Perhaps we can raise it on haskell-cafe and/or
libraries. I think the resistance at GHC HQ is not the difficulty but
the fear of breaking things and upsetting people. If there were an
obvious consensus that fear might be allayed.

Duncan