patch applied (cabal): First pass at parsing .cabal files as UTF8

Sat Feb 23 10:40:25 PST 2008 Duncan Coutts

On Sat, Feb 23, 2008 at 10:49:59AM -0800, Duncan Coutts wrote:
Sat Feb 23 10:40:25 PST 2008 Duncan Coutts
* First pass at parsing .cabal files as UTF8 Also print output and error messages etc in UTF8.
On the input side, wouldn't it be better to do this on the boundary, so that String always means a list of Chars, not octets? That is, have a readUTFFile that opens the file in binary mode and applies fromUTF to its contents, yielding a real String.

On Sun, 2008-02-24 at 14:44 +0000, Ross Paterson wrote:
On Sat, Feb 23, 2008 at 10:49:59AM -0800, Duncan Coutts wrote:
Sat Feb 23 10:40:25 PST 2008 Duncan Coutts
* First pass at parsing .cabal files as UTF8 Also print output and error messages etc in UTF8. On the input side, wouldn't it be better to do this on the boundary, so that String always means a list of Chars, not octets? That is, have a readUTFFile that opens the file in binary mode and applies fromUTF to its contents, yielding a real String.
You're right of course. I've added readTextFile and writeTextFile to the Utils module and checked all other uses of readFile and writeFile. I've also switched the rawSystemStdout to assume UTF8 output format. So what about hackage? It now has to assume the Strings in the package description etc are proper Haskell Unicode Strings and convert to UFT8 output. Distribution.Simple.Utils exports toUTF8 for this purpose. Duncan

On Sun, Feb 24, 2008 at 05:46:35PM +0000, Duncan Coutts wrote:
So what about hackage? It now has to assume the Strings in the package description etc are proper Haskell Unicode Strings and convert to UFT8 output. Distribution.Simple.Utils exports toUTF8 for this purpose.
XHTML output should be OK: it assumes a charset of iso8859-1 and turns higher chars into HTML entities. text/plain output (used by cabal upload) will need some work, though. I'm not sure what the deal is with charset negotiation.

On Sun, 2008-02-24 at 18:28 +0000, Ross Paterson wrote:
On Sun, Feb 24, 2008 at 05:46:35PM +0000, Duncan Coutts wrote:
So what about hackage? It now has to assume the Strings in the package description etc are proper Haskell Unicode Strings and convert to UFT8 output. Distribution.Simple.Utils exports toUTF8 for this purpose.
XHTML output should be OK: it assumes a charset of iso8859-1 and turns higher chars into HTML entities.
Right, ok.
text/plain output (used by cabal upload) will need some work, though. I'm not sure what the deal is with charset negotiation.
Yeah, me neither. I know it's possible in principle but no idea how to do it. It may not be possible through the CGI interface. BTW, I notice the new tags stuff on hackage. I was originally thinking we would use "x-" extra fields in the .cabal file for that kind of thing. They're now parsed an exposed as [(name,value)] in the PackageDescription. The main thing preventing that at the moment is we don't have a pretty printer for package descriptions, at least not one that works. Otherwise, all the stuff about wanting to edit package descriptions after upload, eg to edit the description, add extra links etc could be done that way (with suitable authentication). And the extra tags we'll want like: HAppS/0.8.4/HAppS.cabal: x-hackage-superceded-by: HAppS-Server Duncan

On Sun, Feb 24, 2008 at 08:49:04PM +0000, Duncan Coutts wrote:
On Sun, 2008-02-24 at 18:28 +0000, Ross Paterson wrote:
text/plain output (used by cabal upload) will need some work, though. I'm not sure what the deal is with charset negotiation.
Yeah, me neither. I know it's possible in principle but no idea how to do it. It may not be possible through the CGI interface.
I think I just need to check the Accept-Charset header as well as Accept. I could probably cope with us-ascii or iso-8859-1 (replacing higher Chars) or utf-8. Not as vital as the HTML output, though.
BTW, I notice the new tags stuff on hackage. I was originally thinking we would use "x-" extra fields in the .cabal file for that kind of thing. They're now parsed an exposed as [(name,value)] in the PackageDescription. The main thing preventing that at the moment is we don't have a pretty printer for package descriptions, at least not one that works.
Otherwise, all the stuff about wanting to edit package descriptions after upload, eg to edit the description, add extra links etc could be done that way (with suitable authentication). And the extra tags we'll want like:
HAppS/0.8.4/HAppS.cabal:
x-hackage-superceded-by: HAppS-Server
The tags stuff was just a quick hack, but in general my preference is to keep the additional data outside of the package, so that what one downloads under a particular package-id is always what was uploaded.

On Mon, 2008-02-25 at 00:52 +0000, Ross Paterson wrote:
The tags stuff was just a quick hack, but in general my preference is to keep the additional data outside of the package, so that what one downloads under a particular package-id is always what was uploaded.
I rather like the idea of keeping the .cabal file separate from the tarball exactly to allow us to fix it without altering the tarball. The obvious changes being altering the description to add extra links, etc and to fix lax dependency constraints. Perhaps this is because I've been working with distros for a while where we always have a script that allows us to tweak things to make them work. So it depends on to what extent we want to manage hackage like a distribution or as a pristine upstream site. I think we want to do both really, at least for some subset of packages. I don't think we need to go as far as maintaining patch sets like some distros do. In gentoo, the most common patch we apply in the ebuild scripts are actually changes to the .cabal file. Duncan

On Mon, Feb 25, 2008 at 09:35:58AM +0000, Duncan Coutts wrote:
On Mon, 2008-02-25 at 00:52 +0000, Ross Paterson wrote:
The tags stuff was just a quick hack, but in general my preference is to keep the additional data outside of the package, so that what one downloads under a particular package-id is always what was uploaded.
I rather like the idea of keeping the .cabal file separate from the tarball exactly to allow us to fix it without altering the tarball. The obvious changes being altering the description to add extra links, etc and to fix lax dependency constraints.
Perhaps this is because I've been working with distros for a while where we always have a script that allows us to tweak things to make them work. So it depends on to what extent we want to manage hackage like a distribution or as a pristine upstream site. I think we want to do both really, at least for some subset of packages. I don't think we need to go as far as maintaining patch sets like some distros do. In gentoo, the most common patch we apply in the ebuild scripts are actually changes to the .cabal file.
I don't follow you there. There's a copy of the .cabal file inside the tarball, which you say you're not changing, but it is the tarball that people will download and build. Should the pristine used by secondary distributions include the modified cabal file, and if so should they include a timestamp in the version number?

On Mon, 2008-02-25 at 11:41 +0000, Ross Paterson wrote:
I don't follow you there. There's a copy of the .cabal file inside the tarball, which you say you're not changing, but it is the tarball that people will download and build.
True, however when planning what to install, cabal-install at least will use the .cabal file from the index, so any fixes to deps etc will be taken into account. We would also encourage people writing tools to convert cabal packages to distro packages to use the .cabal file from the index.
Should the pristine used by secondary distributions include the modified cabal file, and if so should they include a timestamp in the version number?
Not sure. In gentoo they use an extra revision number for the ebuild itself that is distinct from the version of the package. If we want to create derivative package sets we may want a revision number or timestamp. Duncan

On Mon, Feb 25, 2008 at 09:11:50PM +0000, Duncan Coutts wrote:
On Mon, 2008-02-25 at 11:41 +0000, Ross Paterson wrote:
I don't follow you there. There's a copy of the .cabal file inside the tarball, which you say you're not changing, but it is the tarball that people will download and build.
True, however when planning what to install, cabal-install at least will use the .cabal file from the index, so any fixes to deps etc will be taken into account. We would also encourage people writing tools to convert cabal packages to distro packages to use the .cabal file from the index.
For what it's worth, I agree with Ross (if I understand his position correctly, at any rate): I don't think we should be changing the external cabal file, nor expecting distributions to use any external changes. If changes are needed then someone should do a new upload, with a different version number. Thanks Ian

On Mon, 2008-02-25 at 23:04 +0000, Ian Lynagh wrote:
On Mon, Feb 25, 2008 at 09:11:50PM +0000, Duncan Coutts wrote:
On Mon, 2008-02-25 at 11:41 +0000, Ross Paterson wrote:
I don't follow you there. There's a copy of the .cabal file inside the tarball, which you say you're not changing, but it is the tarball that people will download and build.
True, however when planning what to install, cabal-install at least will use the .cabal file from the index, so any fixes to deps etc will be taken into account. We would also encourage people writing tools to convert cabal packages to distro packages to use the .cabal file from the index.
For what it's worth, I agree with Ross (if I understand his position correctly, at any rate): I don't think we should be changing the external cabal file, nor expecting distributions to use any external changes. If changes are needed then someone should do a new upload, with a different version number.
Though you'd have to admit that distros will fix dependencies to be more accurate without waiting for any upstream release. Finding accurate deps is often something one only finds after having released and got other people to test on a range of platforms. Duncan

On Sun, Feb 24, 2008 at 05:46:35PM +0000, Duncan Coutts wrote:
I've added readTextFile and writeTextFile to the Utils module and checked all other uses of readFile and writeFile.
I've also switched the rawSystemStdout to assume UTF8 output format.
The read and write functions ought to open their files in binary mode. It's just wrong to read Unicode characters (which is what a plain text Handle promises you) and treat them as bytes. There's a similar problem with using toUTF on stdout and stderr. Haskell 98 is very clear that putChar on those Handles takes Unicode characters, though it does not specify how these are encoded in the environment. GHC has historically assumed an ISO-8859-1 encoding, truncating larger characters, but other implementations could map them to the current locale (as Hugs does). Perhaps a future GHC will map them to UTF. I think you should just hand the characters to putChar and leave their presentation to the implementation, flawed though GHC's currently is.

On Mon, 2008-02-25 at 11:53 +0000, Ross Paterson wrote:
On Sun, Feb 24, 2008 at 05:46:35PM +0000, Duncan Coutts wrote:
I've added readTextFile and writeTextFile to the Utils module and checked all other uses of readFile and writeFile.
I've also switched the rawSystemStdout to assume UTF8 output format.
The read and write functions ought to open their files in binary mode. It's just wrong to read Unicode characters (which is what a plain text Handle promises you) and treat them as bytes. There's a similar problem with using toUTF on stdout and stderr. Haskell 98 is very clear that putChar on those Handles takes Unicode characters, though it does not specify how these are encoded in the environment. GHC has historically assumed an ISO-8859-1 encoding, truncating larger characters, but other implementations could map them to the current locale (as Hugs does). Perhaps a future GHC will map them to UTF. I think you should just hand the characters to putChar and leave their presentation to the implementation, flawed though GHC's currently is.
It is a mess. It's no use pretending that readFile returns Unicode, it just doesn't (except on Hugs which does it properly). GHC is not going to catch up on this any time soon. If we open the files in binary mode we don't get the cr/lf line conversion on Windows and we'd have to do that ourselves. Perhaps that's the way to go. As for stdout/stderr we're just stuffed. We cannot reopen them in binary mode and hugs and ghc have different and incompatible behaviour. We either end up double encoding with hugs or not decoding with ghc. There is no single method that works with both. We'd have to switch on the system in use. Duncan

duncan.coutts:
On Mon, 2008-02-25 at 11:53 +0000, Ross Paterson wrote:
On Sun, Feb 24, 2008 at 05:46:35PM +0000, Duncan Coutts wrote:
I've added readTextFile and writeTextFile to the Utils module and checked all other uses of readFile and writeFile.
I've also switched the rawSystemStdout to assume UTF8 output format.
The read and write functions ought to open their files in binary mode. It's just wrong to read Unicode characters (which is what a plain text Handle promises you) and treat them as bytes. There's a similar problem with using toUTF on stdout and stderr. Haskell 98 is very clear that putChar on those Handles takes Unicode characters, though it does not specify how these are encoded in the environment. GHC has historically assumed an ISO-8859-1 encoding, truncating larger characters, but other implementations could map them to the current locale (as Hugs does). Perhaps a future GHC will map them to UTF. I think you should just hand the characters to putChar and leave their presentation to the implementation, flawed though GHC's currently is.
It is a mess.
It's no use pretending that readFile returns Unicode, it just doesn't (except on Hugs which does it properly). GHC is not going to catch up on this any time soon.
Why don't we use the existing, portable UTF8 IO package? http://hackage.haskell.org/packages/archive/utf8-string/0.2/doc/html/System-... -- Don

On Mon, Feb 25, 2008 at 01:26:52PM -0800, Donald Bruce Stewart wrote:
duncan.coutts:
It's no use pretending that readFile returns Unicode, it just doesn't (except on Hugs which does it properly). GHC is not going to catch up on this any time soon.
Why don't we use the existing, portable UTF8 IO package?
I'd much rather we fix the IO library that's already in the corelibs than add a second one to work around it. I wonder if that's suitable for a SoC project? Thanks Ian

igloo:
On Mon, Feb 25, 2008 at 01:26:52PM -0800, Donald Bruce Stewart wrote:
duncan.coutts:
It's no use pretending that readFile returns Unicode, it just doesn't (except on Hugs which does it properly). GHC is not going to catch up on this any time soon.
Why don't we use the existing, portable UTF8 IO package?
I'd much rather we fix the IO library that's already in the corelibs than add a second one to work around it.
I wonder if that's suitable for a SoC project?
Add System.IO.UTF8.{readFile,writeFile} to the base library? -- Don

On Tue, Feb 26, 2008 at 12:09 AM, Don Stewart
igloo:
On Mon, Feb 25, 2008 at 01:26:52PM -0800, Donald Bruce Stewart wrote:
duncan.coutts:
It's no use pretending that readFile returns Unicode, it just doesn't (except on Hugs which does it properly). GHC is not going to catch up on this any time soon.
Why don't we use the existing, portable UTF8 IO package?
I'd much rather we fix the IO library that's already in the corelibs than add a second one to work around it.
I wonder if that's suitable for a SoC project?
Add System.IO.UTF8.{readFile,writeFile} to the base library?
I'd rather see that we add a more general solution for reading and writing Unicode than add two functions specialized for UTF-8 that we can't remove later when we do have a less ad-hoc solution. -- Johan

johan.tibell:
On Tue, Feb 26, 2008 at 12:09 AM, Don Stewart
wrote: igloo:
On Mon, Feb 25, 2008 at 01:26:52PM -0800, Donald Bruce Stewart wrote:
duncan.coutts:
It's no use pretending that readFile returns Unicode, it just doesn't (except on Hugs which does it properly). GHC is not going to catch up on this any time soon.
Why don't we use the existing, portable UTF8 IO package?
I'd much rather we fix the IO library that's already in the corelibs than add a second one to work around it.
I wonder if that's suitable for a SoC project?
Add System.IO.UTF8.{readFile,writeFile} to the base library?
I'd rather see that we add a more general solution for reading and writing Unicode than add two functions specialized for UTF-8 that we can't remove later when we do have a less ad-hoc solution.
Whatever we decide on, I'd like to see it as a standalone library first, so we can at least get some experience with it. I'm really more suggesting utf8-string as a model -- we don't need to rip apart base to get this done. -- Don

On Tue, Feb 26, 2008 at 12:28:56AM -0800, Don Stewart wrote: L johan.tibell:
On Tue, Feb 26, 2008 at 12:09 AM, Don Stewart
wrote: Add System.IO.UTF8.{readFile,writeFile} to the base library?
I'd rather see that we add a more general solution for reading and writing Unicode than add two functions specialized for UTF-8 that we can't remove later when we do have a less ad-hoc solution.
Whatever we decide on, I'd like to see it as a standalone library first, so we can at least get some experience with it. I'm really more suggesting utf8-string as a model -- we don't need to rip apart base to get this done.
Those two functions (and appendFile) need to be made portable by using openBinaryFile instead of openFile, and then they could be recommended. The handle versions are only safe for binary handles; it's a pity that these don't have a different type. I think that we do need, in base, a type distinct from Handle that offers a Word8 interface to binary I/O, as a foundation for these experiments. (Let's not call them "Handle" this time.) Experiments with file operations and operations on binary handles could proceed independently, but the standard text handles (stdin, stdout and stderr) can't be peeled off. Fixing them has to be done inside base. Would hard-wiring them as UTF-8 be any worse than hard-wiring them as Latin-1 (which GHC does now)?

On Mon, 2008-02-25 at 13:26 -0800, Don Stewart wrote:
duncan.coutts:
Why don't we use the existing, portable UTF8 IO package?
http://hackage.haskell.org/packages/archive/utf8-string/0.2/doc/html/System-...
But it's not. It does exactly the non-portable things that Ross was just complaining about. Also it interprets 4 and 5 byte UTF-8 forms and just uses replacement chars without any option for strict conversions. Also, the conversion code is slow. It does more list operations than are necessary and more safe checked chr operations. Duncan

On Mon, Feb 25, 2008 at 09:07:08PM +0000, Duncan Coutts wrote:
It's no use pretending that readFile returns Unicode, it just doesn't (except on Hugs which does it properly). GHC is not going to catch up on this any time soon.
On the contrary, it's the only way to stay sane. readFile does return Unicode, it just doesn't read UTF. Putting compensating bugs in the libraries is only going to make it harder for GHC to change.
If we open the files in binary mode we don't get the cr/lf line conversion on Windows and we'd have to do that ourselves. Perhaps that's the way to go.
I think we've been ignoring CRs in .cabal files ever since we had to deal with tar files built on Windows and unpacked on Unix.
As for stdout/stderr we're just stuffed. We cannot reopen them in binary mode and hugs and ghc have different and incompatible behaviour. We either end up double encoding with hugs or not decoding with ghc. There is no single method that works with both. We'd have to switch on the system in use.
My suggestion is to just write Chars to these Handles, even though text handles in GHC currently only work in an ISO-8859-1 locale. That's what the other libraries in your program will be doing with those handles, and they're not wrong: the other way lies madness. Is switching the standard text handles to UTF really an impossibly remote prospect?

On Mon, 2008-02-25 at 21:49 +0000, Ross Paterson wrote:
On Mon, Feb 25, 2008 at 09:07:08PM +0000, Duncan Coutts wrote:
It's no use pretending that readFile returns Unicode, it just doesn't (except on Hugs which does it properly). GHC is not going to catch up on this any time soon.
On the contrary, it's the only way to stay sane. readFile does return Unicode, it just doesn't read UTF. Putting compensating bugs in the libraries is only going to make it harder for GHC to change.
True. If fact it'll never help because it's not specified what encoding it should use but we want to use one specific encoding. For printing to stdout we would want to use some future improved standard text handle but for reading .cabal files we're specifying that they are utf-8, irrespective of current locale.
If we open the files in binary mode we don't get the cr/lf line conversion on Windows and we'd have to do that ourselves. Perhaps that's the way to go.
I think we've been ignoring CRs in .cabal files ever since we had to deal with tar files built on Windows and unpacked on Unix.
So if we use files opened in binary mode and account for line end differences then this is portable and doesn't make it harder for GHC to switch text handles to use a more sensible encoding. I'll push patches to do this.
As for stdout/stderr we're just stuffed. We cannot reopen them in binary mode and hugs and ghc have different and incompatible behaviour. We either end up double encoding with hugs or not decoding with ghc. There is no single method that works with both. We'd have to switch on the system in use.
My suggestion is to just write Chars to these Handles, even though text handles in GHC currently only work in an ISO-8859-1 locale.
Well, it's not the locale we're in, it's if we restrict ourselves to only wanting to print ISO-8859-1 chars, and we know we need more than that.
That's what the other libraries in your program will be doing with those handles, and they're not wrong: the other way lies madness.
It doesn't actually change the fact that our error messages will print garbage when they include snippets of a .cabal file that contained non-ISO-8859-1 chars.
Is switching the standard text handles to UTF really an impossibly remote prospect?
I'm not sure really. Perhaps we can raise it on haskell-cafe and/or libraries. I think the resistance at GHC HQ is not the difficulty but the fear of breaking things and upsetting people. If there were an obvious consensus that fear might be allayed. Duncan

On Tue, Feb 26, 2008 at 09:30:41AM +0000, Duncan Coutts wrote:
So if we use files opened in binary mode and account for line end differences then this is portable and doesn't make it harder for GHC to switch text handles to use a more sensible encoding.
Yes. We have to handle line-endings independently of the host system anyway, because the files we're reading could have been created on a different system.
It doesn't actually change the fact that our error messages will print garbage when they include snippets of a .cabal file that contained non-ISO-8859-1 chars.
Yes, because GHC's text handles cannot cope with such characters. So instead of trying to patch around that, how about just replacing these characters with "???" or "" on output (if compiling with GHC)? It's not pretty, but hopefully it's temporary, and these are only error message we're talking about.

On Mon, 2008-02-25 at 21:49 +0000, Ross Paterson wrote:
On Mon, Feb 25, 2008 at 09:07:08PM +0000, Duncan Coutts wrote:
It's no use pretending that readFile returns Unicode, it just doesn't (except on Hugs which does it properly). GHC is not going to catch up on this any time soon.
On the contrary, it's the only way to stay sane. readFile does return Unicode, it just doesn't read UTF. Putting compensating bugs in the libraries is only going to make it harder for GHC to change.
My suggestion is to just write Chars to these Handles, even though text handles in GHC currently only work in an ISO-8859-1 locale. That's what the other libraries in your program will be doing with those handles, and they're not wrong: the other way lies madness.
So that's basically what I've done in the most recent patches. I pretend that read/writeFile and putStr etc work for text in the current locale encoding. For files we know specifically are UTF8 because we declare that to be the case (like .cabal and .hs) we now use to/fromUTF8 and openBinaryFile. Hmm, having said that we're not yet treating line endings in .hs files correctly on windows. Sigh.
Is switching the standard text handles to UTF really an impossibly remote prospect?
Seems not :-) Duncan
participants (5)
-
Don Stewart
-
Duncan Coutts
-
Ian Lynagh
-
Johan Tibell
-
Ross Paterson