Advance notice that I'd like to make Cabal depend on parsec

Hi folks, I want to give you advance notice that I would like to make Cabal depend on parsec. The implication is that GHC would therefore depend on parsec and thus it would become a core package, rather than just a HP package. So this would affect both GHC and the HP, though I hope not too much. The rationale is that Cabal needs to parse things, like .cabal files and currently we do not have a decent parser in the core libraries. By decent I mean one that can produce error messages with source locations and that doesn't have unpredictable memory use. The only parser in the core libraries at the moment is Text.ParserCombinators.ReadP from the base package and that fails my "decent" criteria on both counts. Its idea of an error message is (), and on some largish .cabal files we take 100s of MB to parse (I realise that the ReadP in the base package is a cutdown version so I don't mean to malign all ReadP-style libs out there). Partly due to the performance problem, the terrible .cabal file error messages, and partly because Doaitse Swierstra keeps asking me if .cabal files have a grammar, I've been writing a new .cabal parser. It uses an alex lexer and a parsec parser. It's fast and the error messages are pretty good. I have reverse engineered a grammar that closely matches the existing parser and .cabal files in the wild, though I'm not sure Doaitse will be satisfied with the approach I've taken to handling layout. Why did I choose parsec? Practicality dictates that I can only use things in the core libraries, and the nearest thing we have to that is the parser lib that is in the HP. I tried to use happy but I could not construct a grammar/lexer combo to handle the layout (also, happy is not exactly known for its great error messages). I've been doing regression testing against hackage and I'm satisfied that the new parser matches close enough. I've uncovered all kinds of horrors with .cabal files in the wild relying on quirks of the old parser. I've made adjustments for most of them but I will be breaking a half dozen old packages (most of those don't actually build correctly because though their syntax errors are not picked up by the parser, they do cause failure eventually). So far I've just done the outline parser, not the individual field parsers. I'll be doing those next and then integrate. So this change is still a bit of a ways off, but I thought it'd be useful to warn people now. Duncan

On Thu, Mar 14, 2013 at 3:53 PM, Duncan Coutts wrote: Hi folks, I want to give you advance notice that I would like to make Cabal depend
on parsec. The implication is that GHC would therefore depend on parsec
and thus it would become a core package, rather than just a HP package.
So this would affect both GHC and the HP, though I hope not too much. +1 from me, although the amount of potential knock-on work might be
discouraging. The current cabal-install bootstrap process (which is
currently pretty easy and is necessary at times) will get a bunch more deps
as a result of this change, no?
--
Gregory Collins

On Thu, 2013-03-14 at 16:06 +0100, Gregory Collins wrote:
On Thu, Mar 14, 2013 at 3:53 PM, Duncan Coutts
wrote:
Hi folks,
I want to give you advance notice that I would like to make Cabal depend on parsec. The implication is that GHC would therefore depend on parsec and thus it would become a core package, rather than just a HP package. So this would affect both GHC and the HP, though I hope not too much.
+1 from me, although the amount of potential knock-on work might be discouraging. The current cabal-install bootstrap process (which is currently pretty easy and is necessary at times) will get a bunch more deps as a result of this change, no?
Yes it will, but given that we do have a script it's not too bad I think. And overall I think its worth it to have the better error messages, performance and memory use. Do you have any idea how slow it is to parse all the .cabal files on hackage, and how much memory that takes? You'd be horrified :-) Duncan

On 14 March 2013 22:53, Duncan Coutts
I've been doing regression testing against hackage and I'm satisfied that the new parser matches close enough. I've uncovered all kinds of horrors with .cabal files in the wild relying on quirks of the old parser. I've made adjustments for most of them but I will be breaking a half dozen old packages
When you say you've "made adjustments for" dodgy .cabal files in the wild, do you mean that you'll send those maintainers patches that make their cabal files less dodgy, or do you mean you've added hacks to your parser to reproduce the quirky behaviour? Conrad.

On Fri, 2013-03-15 at 12:37 +0800, Conrad Parker wrote:
On 14 March 2013 22:53, Duncan Coutts
wrote: I've been doing regression testing against hackage and I'm satisfied that the new parser matches close enough. I've uncovered all kinds of horrors with .cabal files in the wild relying on quirks of the old parser. I've made adjustments for most of them but I will be breaking a half dozen old packages
When you say you've "made adjustments for" dodgy .cabal files in the wild, do you mean that you'll send those maintainers patches that make their cabal files less dodgy, or do you mean you've added hacks to your parser to reproduce the quirky behaviour?
The latter, but the egregiousness of the hacks is actually not too bad in the end. I don't find it revolting. For the worst examples I didn't make adjustments and those ones will break. I think I've made a reasonable judgement about the where to draw the line between the two. I can look into generating warnings in those cases (which is probably better than me emailing them). Duncan

On 14 Mar 2013, at 14:53, Duncan Coutts wrote:
Why did I choose parsec? Practicality dictates that I can only use things in the core libraries, and the nearest thing we have to that is the parser lib that is in the HP.
I fully agree that a real parser is needed for Cabal files. I implemented one myself, many years ago, using the polyparse library, and using a hand-written lexer. Feel free to reuse it (attached, together with a sample program) if you like, although I expect it has bit-rotted a little over time. Regards, Malcolm

On Fri, 2013-03-15 at 12:57 +0000, Malcolm Wallace wrote:
On 14 Mar 2013, at 14:53, Duncan Coutts wrote:
Why did I choose parsec? Practicality dictates that I can only use things in the core libraries, and the nearest thing we have to that is the parser lib that is in the HP.
I fully agree that a real parser is needed for Cabal files. I implemented one myself, many years ago, using the polyparse library, and using a hand-written lexer. Feel free to reuse it (attached, together with a sample program) if you like, although I expect it has bit-rotted a little over time.
Thanks Malcolm. I should point out that I would also be perfectly happy to use polyparse. The practical constraint is that Cabal can only depend on other Core libs. My assumption was that moving parsec from HP to core was easier than adding polyparse into core. But if someone wanted to suggest ripping ReadP out of base and replacing it with polyparse, I would certainly not complain. Duncan

I'd love to have a proper parser and source-location-aware AST for sake of
editor/IDE tools, so +1 from me. If you don't end up doing this after all,
I'd still like to see your parser in a separate package, although I
understand if you don't feel like maintaining two parsers especially given
the tedious process for verifying they work similarly. I guess it could
still be useful in the same way we find haskell-src-exts useful despite
some incompatibilities with GHC.
On Thu, Mar 14, 2013 at 3:53 PM, Duncan Coutts wrote: Hi folks, I want to give you advance notice that I would like to make Cabal depend
on parsec. The implication is that GHC would therefore depend on parsec
and thus it would become a core package, rather than just a HP package.
So this would affect both GHC and the HP, though I hope not too much. The rationale is that Cabal needs to parse things, like .cabal files and
currently we do not have a decent parser in the core libraries. By
decent I mean one that can produce error messages with source locations
and that doesn't have unpredictable memory use. The only parser in the
core libraries at the moment is Text.ParserCombinators.ReadP from the
base package and that fails my "decent" criteria on both counts. Its
idea of an error message is (), and on some largish .cabal files we take
100s of MB to parse (I realise that the ReadP in the base package is a
cutdown version so I don't mean to malign all ReadP-style libs out
there). Partly due to the performance problem, the terrible .cabal file error
messages, and partly because Doaitse Swierstra keeps asking me if .cabal
files have a grammar, I've been writing a new .cabal parser. It uses an
alex lexer and a parsec parser. It's fast and the error messages are
pretty good. I have reverse engineered a grammar that closely matches
the existing parser and .cabal files in the wild, though I'm not sure
Doaitse will be satisfied with the approach I've taken to handling
layout. Why did I choose parsec? Practicality dictates that I can only use
things in the core libraries, and the nearest thing we have to that is
the parser lib that is in the HP. I tried to use happy but I could not
construct a grammar/lexer combo to handle the layout (also, happy is not
exactly known for its great error messages). I've been doing regression testing against hackage and I'm satisfied
that the new parser matches close enough. I've uncovered all kinds of
horrors with .cabal files in the wild relying on quirks of the old
parser. I've made adjustments for most of them but I will be breaking a
half dozen old packages (most of those don't actually build correctly
because though their syntax errors are not picked up by the parser, they
do cause failure eventually). So far I've just done the outline parser, not the individual field
parsers. I'll be doing those next and then integrate. So this change is
still a bit of a ways off, but I thought it'd be useful to warn people
now. Duncan _______________________________________________
cabal-devel mailing list
cabal-devel@haskell.org
http://www.haskell.org/mailman/listinfo/cabal-devel

This thread is raising all sorts of questions for me: Is it essential, or even sensical, that the serialization format GHC needs for storing package info bear any relation to the human authored form? If not, the split out of the package types could be accomplished in a way where GHC uses simple show/read(P) style serialization for storage of package info, where as cabal-lib would use a lovely parsec parser for humans. I'd like this approach. The issue of putting the yet one more HP package into GHC's core packages is increasing the exposure of the difficulty of the current GHC/HP relationship. See also threads in HP's mailing list for why can't we bump some packages in GHC's core set for the next HP release. The split arrangement is strange because we have two groups making up what is in the HP, but they have different processes and aims. The complex technical relationship between the moving parts only heightens the difficulty. Perhaps the major cause is that because GHC is shipped as a library itself, it exposes all it's package dependencies. And as it is a large, and growing, piece of software, the list only wants to grow. But I wonder how often GHC is used as a library itself? If not often, then perhaps GHC should be shipped as two parts: Just a compiler (plus the small number of packages that the compiler forces), and ghc-lib as an optional, even separate, package - perhaps one with even a traditional way of depending on other packages. In otherwords, users that wanted to incorporate the ghc-lib into their programs would depend, and download, and configure, and build, ghc-lib indpenendant of the GHC binaries installed on their system. Perhaps then, GHC, the compiler, built from ghc-lib, would be bootstrapped not from the past compiler, but from the past HP..... Okay, perhaps that is all just fantasy. But, no other programming system operates the way we do. They all fall into one of two camps: - The dominant implementation is maintained, built, and shipped along with a large collection of "common packages". Examples: Python, Ruby, PHP, Java. - The dominant implementation is shipped as a bare tool, and large common libraries are maintained and shipped independently. Examples: C++ (think g++ and boost), JavaScript (think browsers, and jQuery). We are in the middle and, I think, experiencing growing pains because of it. - Mark On Sat, Mar 16, 2013 at 3:42 PM, dag.odenhall@gmail.com < dag.odenhall@gmail.com> wrote:
I'd love to have a proper parser and source-location-aware AST for sake of editor/IDE tools, so +1 from me. If you don't end up doing this after all, I'd still like to see your parser in a separate package, although I understand if you don't feel like maintaining two parsers especially given the tedious process for verifying they work similarly. I guess it could still be useful in the same way we find haskell-src-exts useful despite some incompatibilities with GHC.
On Thu, Mar 14, 2013 at 3:53 PM, Duncan Coutts < duncan.coutts@googlemail.com> wrote:
Hi folks,
I want to give you advance notice that I would like to make Cabal depend on parsec. The implication is that GHC would therefore depend on parsec and thus it would become a core package, rather than just a HP package. So this would affect both GHC and the HP, though I hope not too much.
The rationale is that Cabal needs to parse things, like .cabal files and currently we do not have a decent parser in the core libraries. By decent I mean one that can produce error messages with source locations and that doesn't have unpredictable memory use. The only parser in the core libraries at the moment is Text.ParserCombinators.ReadP from the base package and that fails my "decent" criteria on both counts. Its idea of an error message is (), and on some largish .cabal files we take 100s of MB to parse (I realise that the ReadP in the base package is a cutdown version so I don't mean to malign all ReadP-style libs out there).
Partly due to the performance problem, the terrible .cabal file error messages, and partly because Doaitse Swierstra keeps asking me if .cabal files have a grammar, I've been writing a new .cabal parser. It uses an alex lexer and a parsec parser. It's fast and the error messages are pretty good. I have reverse engineered a grammar that closely matches the existing parser and .cabal files in the wild, though I'm not sure Doaitse will be satisfied with the approach I've taken to handling layout.
Why did I choose parsec? Practicality dictates that I can only use things in the core libraries, and the nearest thing we have to that is the parser lib that is in the HP. I tried to use happy but I could not construct a grammar/lexer combo to handle the layout (also, happy is not exactly known for its great error messages).
I've been doing regression testing against hackage and I'm satisfied that the new parser matches close enough. I've uncovered all kinds of horrors with .cabal files in the wild relying on quirks of the old parser. I've made adjustments for most of them but I will be breaking a half dozen old packages (most of those don't actually build correctly because though their syntax errors are not picked up by the parser, they do cause failure eventually).
So far I've just done the outline parser, not the individual field parsers. I'll be doing those next and then integrate. So this change is still a bit of a ways off, but I thought it'd be useful to warn people now.
Duncan
_______________________________________________ cabal-devel mailing list cabal-devel@haskell.org http://www.haskell.org/mailman/listinfo/cabal-devel

On Sun, Mar 17, 2013 at 09:57:25AM -0700, Mark Lentczner wrote:
Is it essential, or even sensical, that the serialization format GHC needs for storing package info bear any relation to the human authored form? If not, the split out of the package types could be accomplished in a way where GHC uses simple show/read(P) style serialization for storage of package info, where as cabal-lib would use a lovely parsec parser for humans. I'd like this approach.
I think it would be feasible to stop GHC itself from using the human readable format. The only place I can think of it being used is in the package database, but we could use either Read/Show for that, or just exclusively use the binary format. It would be a little less user-friendly, but maybe worth it to remove the ghc library dependencies on most-of-Cabal, mtl and parsec.
Perhaps the major cause is that because GHC is shipped as a library itself, it exposes all it's package dependencies.
Yes.
In otherwords, users that wanted to incorporate the ghc-lib into their programs would depend, and download, and configure, and build, ghc-lib indpenendant of the GHC binaries
I think this would create more problems than it solves.
Okay, perhaps that is all just fantasy. But, no other programming system operates the way we do. They all fall into one of two camps:
- The dominant implementation is maintained, built, and shipped along with a large collection of "common packages". Examples: Python, Ruby, PHP, Java. - The dominant implementation is shipped as a bare tool, and large common libraries are maintained and shipped independently. Examples: C++ (think g++ and boost), JavaScript (think browsers, and jQuery).
We are in the middle and, I think, experiencing growing pains because of it.
I would say that we are doing the first option, in the form of the HP. It's just that the core gets frozen (i.e., ghc + libs gets released) earlier than the higher level libraries. I don't think that moving (back) to trying to freeze/release everything all at once would be an improvement. You just need to remain strong, and keep saying "no" :-) (you're doing a great job, BTW!) Thanks Ian

On Sun, 17 Mar 2013, Ian Lynagh wrote:
I think it would be feasible to stop GHC itself from using the human readable format. The only place I can think of it being used is in the package database, but we could use either Read/Show for that, or just exclusively use the binary format.
I already needed the human readable format in order to check what information a custom configure file generated.

On Sun, Mar 17, 2013 at 09:04:58PM +0100, Henning Thielemann wrote:
On Sun, 17 Mar 2013, Ian Lynagh wrote:
I think it would be feasible to stop GHC itself from using the human readable format. The only place I can think of it being used is in the package database, but we could use either Read/Show for that, or just exclusively use the binary format.
I already needed the human readable format in order to check what information a custom configure file generated.
You can use "ghc-pkg describe p" for that. I don't think you should ever need the human readable format unless you need to alter the package database by hand. -- Ian Lynagh, Haskell Consultant Well-Typed LLP, http://www.well-typed.com/

On Sun, 17 Mar 2013, Ian Lynagh wrote:
On Sun, Mar 17, 2013 at 09:04:58PM +0100, Henning Thielemann wrote:
On Sun, 17 Mar 2013, Ian Lynagh wrote:
I think it would be feasible to stop GHC itself from using the human readable format. The only place I can think of it being used is in the package database, but we could use either Read/Show for that, or just exclusively use the binary format.
I already needed the human readable format in order to check what information a custom configure file generated.
You can use "ghc-pkg describe p" for that.
I don't think you should ever need the human readable format unless you need to alter the package database by hand.
I think I also altered these package descriptions in order to check what the correct content should be.

is there any downside using a human readable Read/Show instance for the ghc package database serialization piece? adding parsec to the baked into ghc core would have pretty strong implications on which versions of other packages (where applicable) can be built with ghc... *Zooming out a teeny bit*... wasn't there some recent discussion on making it easier to have ghc coherently cope with multiple versions of a library being installed? I feel like this parsec inclusion would be a lot *less* contentious once thats been done, because then the ghc usage of the parsec library could be private / hidden from end users (possibly?) cheers -Carter On Sun, Mar 17, 2013 at 4:20 PM, Henning Thielemann < lemming@henning-thielemann.de> wrote:
On Sun, 17 Mar 2013, Ian Lynagh wrote:
On Sun, Mar 17, 2013 at 09:04:58PM +0100, Henning Thielemann wrote:
On Sun, 17 Mar 2013, Ian Lynagh wrote:
I think it would be feasible to stop GHC itself from using the human
readable format. The only place I can think of it being used is in the package database, but we could use either Read/Show for that, or just exclusively use the binary format.
I already needed the human readable format in order to check what information a custom configure file generated.
You can use "ghc-pkg describe p" for that.
I don't think you should ever need the human readable format unless you need to alter the package database by hand.
I think I also altered these package descriptions in order to check what the correct content should be.
______________________________**_________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/**mailman/listinfo/ghc-devshttp://www.haskell.org/mailman/listinfo/ghc-devs

On Sun, 2013-03-17 at 21:04 +0100, Henning Thielemann wrote:
On Sun, 17 Mar 2013, Ian Lynagh wrote:
I think it would be feasible to stop GHC itself from using the human readable format. The only place I can think of it being used is in the package database, but we could use either Read/Show for that, or just exclusively use the binary format.
I already needed the human readable format in order to check what information a custom configure file generated.
Or more generally, the classic way to make the pkg info if you were not using the "simple" cabal build system, but were using configure + make (e.g. wrapped in the cabal "make" build-type) was to generate the input file using configure/m4 text substitutions. So that did/does need to be human readable. As for the binary format, that's ghc's internal representation and not something I think we would want to standardise between Haskell implementations. Note that other Haskell impls use a package database that just uses these human readable files, with no hc-pkg style program. Duncan

Hi, Am Sonntag, den 17.03.2013, 21:04 +0100 schrieb Henning Thielemann:
On Sun, 17 Mar 2013, Ian Lynagh wrote:
I think it would be feasible to stop GHC itself from using the human readable format. The only place I can think of it being used is in the package database, but we could use either Read/Show for that, or just exclusively use the binary format.
I already needed the human readable format in order to check what information a custom configure file generated.
the debian packaging scripts modify the package data: $(if $(HASKELL_HIDE_PACKAGES),sed -i 's/^exposed: True$$/exposed: False/' $$pkg_config;) \ and also parses data from not-yet registered package files, and I think it also changes paths somewhere. For all that plumbing stuff, human readable file formats are very convenient. Greetings, Joachim -- Joachim "nomeata" Breitner Debian Developer nomeata@debian.org | ICQ# 74513189 | GPG-Keyid: 4743206C JID: nomeata@joachim-breitner.de | http://people.debian.org/~nomeata

On Sun, 2013-03-17 at 19:27 +0000, Ian Lynagh wrote:
On Sun, Mar 17, 2013 at 09:57:25AM -0700, Mark Lentczner wrote:
Is it essential, or even sensical, that the serialization format GHC needs for storing package info bear any relation to the human authored form? If not, the split out of the package types could be accomplished in a way where GHC uses simple show/read(P) style serialization for storage of package info, where as cabal-lib would use a lovely parsec parser for humans. I'd like this approach.
I think it would be feasible to stop GHC itself from using the human readable format. The only place I can think of it being used is in the package database, but we could use either Read/Show for that, or just exclusively use the binary format.
The change in functionality to enable that would be that the binary "cache" would always have to be up to date, so ghc would only ever have to read the cache and never have to read the human-readable package files. Then you can have ghc-pkg depend on Cabal and use that for the human-readable bits, but since that's a program then it doesn't expose the Cabal lib dependency. Then ghc (and hence the ghc lib) would not depend on Cabal, but it would need a copy of the InstalledPackageInfo type and the other types that it uses. Duncan

On Mon, Mar 18, 2013 at 12:34:16PM +0000, Duncan Coutts wrote:
Then you can have ghc-pkg depend on Cabal and use that for the human-readable bits, but since that's a program then it doesn't expose the Cabal lib dependency. Then ghc (and hence the ghc lib) would not depend on Cabal, but it would need a copy of the InstalledPackageInfo type and the other types that it uses.
Right, exactly. But we don't want to have 2 copies of the types, so could we move them into a Cabal-datatypes package which can be shared by both Cabal and GHC please? Thanks Ian

On Mon, 2013-03-18 at 12:43 +0000, Ian Lynagh wrote:
On Mon, Mar 18, 2013 at 12:34:16PM +0000, Duncan Coutts wrote:
Then you can have ghc-pkg depend on Cabal and use that for the human-readable bits, but since that's a program then it doesn't expose the Cabal lib dependency. Then ghc (and hence the ghc lib) would not depend on Cabal, but it would need a copy of the InstalledPackageInfo type and the other types that it uses.
Right, exactly. But we don't want to have 2 copies of the types, so could we move them into a Cabal-datatypes package which can be shared by both Cabal and GHC please?
That would be a rather annoying split. The cabal-lib package itself is supposed to be just types + parsers + pretty printers (& related utils). It'd end up looking like: cabal-types: types: InstalledPackageInfo, PackageName, Version, PackageId, InstalledPackageId, License cabal-lib: parser for InstalledPackageInfo, PackageName, Version, PackageId, InstalledPackageId, License modules Distribution.* cabal-build-simple: modules Distribution.Simple.* It's not as if one could frame this as a "the aspects of the Cabal spec that compilers need" because the other impls will want the parser + printers as well. Duncan

Hello,
To me it seems that the dependency here is incorrect---as far as I
understand, GHC does not need to parse Cabal files, so it should not depend
on the code and the library to do so.
Furthermore, what is the overall architecture of the whole thing? My
understanding has been that each implementation should have its own notion
of a "package", and cabal simply has support for working with the package
formats for each implementation. Thus, it seems that the package types and
code for (de)serializing them should be in the implementation (i.e., GHC),
not Cabal. I can see that it might make sense to have a common
representation about package meta-data (e.g., names, versions, license,
etc.), so perhaps these should all go in a separate package. This looks a
bit like the `cabal-types` that Duncan suggested, but I'd imagine that
Cabal would need more types than just package meta-data so this is not an
ideal name.
Finally, I agree that a "real" parser is good, but do you really want to
write it using Parsec? A sensible alternative would be to write a Happy
grammar. Having an actual grammar would both benefit users of the system,
and it would avoid the dependency on all those package. My experience of
having to maintain some largish Parsec (and in general, combinator based)
parsers, is that over the years the parsers get more and more complex, and
are quite hard to maintain.
-Iavor
On Mon, Mar 18, 2013 at 6:14 AM, Duncan Coutts wrote: On Mon, 2013-03-18 at 12:43 +0000, Ian Lynagh wrote: On Mon, Mar 18, 2013 at 12:34:16PM +0000, Duncan Coutts wrote: Then you can have ghc-pkg depend on Cabal and use that for the
human-readable bits, but since that's a program then it doesn't expose
the Cabal lib dependency. Then ghc (and hence the ghc lib) would not
depend on Cabal, but it would need a copy of the InstalledPackageInfo
type and the other types that it uses. Right, exactly. But we don't want to have 2 copies of the types, so
could we move them into a Cabal-datatypes package which can be shared by
both Cabal and GHC please? That would be a rather annoying split. The cabal-lib package itself is
supposed to be just types + parsers + pretty printers (& related utils).
It'd end up looking like: cabal-types:
types: InstalledPackageInfo, PackageName, Version, PackageId,
InstalledPackageId, License cabal-lib:
parser for InstalledPackageInfo, PackageName, Version, PackageId,
InstalledPackageId, License
modules Distribution.* cabal-build-simple:
modules Distribution.Simple.* It's not as if one could frame this as a "the aspects of the Cabal spec
that compilers need" because the other impls will want the parser +
printers as well. Duncan _______________________________________________
ghc-devs mailing list
ghc-devs@haskell.org
http://www.haskell.org/mailman/listinfo/ghc-devs

On Mon, 2013-03-18 at 09:32 -0700, Iavor Diatchki wrote:
Hello,
To me it seems that the dependency here is incorrect---as far as I understand, GHC does not need to parse Cabal files, so it should not depend on the code and the library to do so.
Yes, GHC does not parse .cabal files.
Furthermore, what is the overall architecture of the whole thing? My understanding has been that each implementation should have its own notion of a "package", and cabal simply has support for working with the package formats for each implementation. Thus, it seems that the package types and code for (de)serializing them should be in the implementation (i.e., GHC), not Cabal. I can see that it might make sense to have a common representation about package meta-data (e.g., names, versions, license, etc.), so perhaps these should all go in a separate package. This looks a bit like the `cabal-types` that Duncan suggested, but I'd imagine that Cabal would need more types than just package meta-data so this is not an ideal name.
The Cabal spec defines a few things that all Haskell implementations are supposed to support/provide. This covers the notion of an installed package (not .cabal source packages). It defines what it is, the meta-data that is stored and the format in which the implementation should accept it (ie the input format for ghc-pkg). So it's a compiler independent notion and the natural place to put the code to support it was the Cabal lib, so that's what happened. We can split that off into another package but it still makes sense for that package to provide a parser and pretty printer (because implementations have to accept them in the external format). So the natural way to partition things doesn't help us avoid ghc depending on a parser lib.
Finally, I agree that a "real" parser is good, but do you really want to write it using Parsec? A sensible alternative would be to write a Happy grammar. Having an actual grammar would both benefit users of the system, and it would avoid the dependency on all those package. My experience of having to maintain some largish Parsec (and in general, combinator based) parsers, is that over the years the parsers get more and more complex, and are quite hard to maintain.
I'm somewhat repeating other parts of the thread at this point (see the discussion on "why not happy"), but I'd like to point out again that it is not simply the outline parser for cabal-style files that we're talking about. We also need parsers/pretty printers for all the various little types that make up the info about packages, like versions, package names, package ids, version constraints, module names, licenses etc etc. In the Cabal lib we have a type class with parser and pretty printer and all these various little types are instances. That stuff pretty much has to use a combinator lib because it needs to be compositional. You need to be able to reuse the version number parser in other parsers, or programs like cabal-install or the hackage-server need to reuse them as part of other parsers in their command line UI or config files, urls etc. So we have to use a combinator lib anyway. We currently also use this for parsing the fields of .cabal files and the InstalledPackageInfo. Currently we use ReadP from the base library which produces no error messages and has pretty bad performance in some cases (exploding memory use). It's that part particularly that should be using something like parsec. Happy simply isn't an option there. I could possibly use happy for the outline parser but it wouldn't buy us much. Duncan

On 03/18/2013 12:55 PM, Duncan Coutts wrote:
[...] it is not simply the outline parser for cabal-style files that we're talking about. We also need parsers/pretty printers for all the various little types that make up the info about packages, like versions, package names, package ids, version constraints, module names, licenses etc etc.
(ignorant musing that doesn't help the general difficult of writing a Happy parser: ) Can they not use multiple Happy parsers generated from the same Happy file? http://www.haskell.org/happy/doc/html/sec-multiple-parsers.html -Isaac

On Thu, 2013-03-21 at 17:51 -0400, Isaac Dupree wrote:
On 03/18/2013 12:55 PM, Duncan Coutts wrote:
[...] it is not simply the outline parser for cabal-style files that we're talking about. We also need parsers/pretty printers for all the various little types that make up the info about packages, like versions, package names, package ids, version constraints, module names, licenses etc etc.
(ignorant musing that doesn't help the general difficult of writing a Happy parser: ) Can they not use multiple Happy parsers generated from the same Happy file? http://www.haskell.org/happy/doc/html/sec-multiple-parsers.html
Well the compositionality is there for the benefit of other packages, not just as an internal convenience for the Cabal lib. If we dropped that feature then yes we could use monolithic parsers for each of these types. Other packages do use the ability to build new parsers out of old however, in particular cabal-install does. Duncan

Is it essential, or even sensical, that the serialization format GHC needs for storing package info bear any relation to the human authored form? If not, the split out of the package types could be accomplished in a way where GHC uses simple show/read(P) style serialization for storage of package info, where as cabal-lib would use a lovely parsec parser for humans. I'd like this approach.
Good idea -- esp if it makes the packaging story simpler. GHC already uses a binary format for interface files, so there’s no good reason to use a human-readable format for package data base stuff. For interface files you can read them with ghc --show-iface, and as Ian remarks something similar is already true for the package data base.
Simon
From: ghc-devs-bounces@haskell.org [mailto:ghc-devs-bounces@haskell.org] On Behalf Of Mark Lentczner
Sent: 17 March 2013 16:57
To: dag.odenhall@gmail.com
Cc: Haskell Libraries; cabal-devel; Duncan Coutts; ghc-devs@haskell.org; Antoine Latter
Subject: Re: Advance notice that I'd like to make Cabal depend on parsec
This thread is raising all sorts of questions for me:
Is it essential, or even sensical, that the serialization format GHC needs for storing package info bear any relation to the human authored form? If not, the split out of the package types could be accomplished in a way where GHC uses simple show/read(P) style serialization for storage of package info, where as cabal-lib would use a lovely parsec parser for humans. I'd like this approach.
The issue of putting the yet one more HP package into GHC's core packages is increasing the exposure of the difficulty of the current GHC/HP relationship. See also threads in HP's mailing list for why can't we bump some packages in GHC's core set for the next HP release. The split arrangement is strange because we have two groups making up what is in the HP, but they have different processes and aims. The complex technical relationship between the moving parts only heightens the difficulty.
Perhaps the major cause is that because GHC is shipped as a library itself, it exposes all it's package dependencies. And as it is a large, and growing, piece of software, the list only wants to grow. But I wonder how often GHC is used as a library itself? If not often, then perhaps GHC should be shipped as two parts: Just a compiler (plus the small number of packages that the compiler forces), and ghc-lib as an optional, even separate, package - perhaps one with even a traditional way of depending on other packages. In otherwords, users that wanted to incorporate the ghc-lib into their programs would depend, and download, and configure, and build, ghc-lib indpenendant of the GHC binaries installed on their system. Perhaps then, GHC, the compiler, built from ghc-lib, would be bootstrapped not from the past compiler, but from the past HP.....
Okay, perhaps that is all just fantasy. But, no other programming system operates the way we do. They all fall into one of two camps:
* The dominant implementation is maintained, built, and shipped along with a large collection of "common packages". Examples: Python, Ruby, PHP, Java.
* The dominant implementation is shipped as a bare tool, and large common libraries are maintained and shipped independently. Examples: C++ (think g++ and boost), JavaScript (think browsers, and jQuery).
We are in the middle and, I think, experiencing growing pains because of it.
- Mark
On Sat, Mar 16, 2013 at 3:42 PM, dag.odenhall@gmail.commailto:dag.odenhall@gmail.com

On 18 March 2013 03:08, Simon Peyton-Jones
Is it essential, or even sensical, that the serialization format GHC needs for storing package info bear any relation to the human authored form? If not, the split out of the package types could be accomplished in a way where GHC uses simple show/read(P) style serialization for storage of package info, where as cabal-lib would use a lovely parsec parser for humans. I'd like this approach.
Good idea -- esp if it makes the packaging story simpler. GHC already uses a binary format for interface files, so there’s no good reason to use a human-readable format for package data base stuff. For interface files you can read them with ghc --show-iface, and as Ian remarks something similar is already true for the package data base.
A bit of background here: the binary serialisation of packages is an optimisation only (though an important one), and is done independently of Cabal. To install a Cabal package you can put the package description file that Cabal generates into GHC's database directory, and it is picked up automatically. The binary cache can be updated separately with 'ghc-pkg recache'. It was done this way to make it easier for Linux distros that want to install packages by moving files into place and then running comands. So I don't think you want Cabal to know about the binary serialization format, it's a GHC-only optimisation. Cheers, Simon
Simon
From: ghc-devs-bounces@haskell.org [mailto:ghc-devs-bounces@haskell.org] On Behalf Of Mark Lentczner Sent: 17 March 2013 16:57 To: dag.odenhall@gmail.com Cc: Haskell Libraries; cabal-devel; Duncan Coutts; ghc-devs@haskell.org; Antoine Latter Subject: Re: Advance notice that I'd like to make Cabal depend on parsec
This thread is raising all sorts of questions for me:
Is it essential, or even sensical, that the serialization format GHC needs for storing package info bear any relation to the human authored form? If not, the split out of the package types could be accomplished in a way where GHC uses simple show/read(P) style serialization for storage of package info, where as cabal-lib would use a lovely parsec parser for humans. I'd like this approach.
The issue of putting the yet one more HP package into GHC's core packages is increasing the exposure of the difficulty of the current GHC/HP relationship. See also threads in HP's mailing list for why can't we bump some packages in GHC's core set for the next HP release. The split arrangement is strange because we have two groups making up what is in the HP, but they have different processes and aims. The complex technical relationship between the moving parts only heightens the difficulty.
Perhaps the major cause is that because GHC is shipped as a library itself, it exposes all it's package dependencies. And as it is a large, and growing, piece of software, the list only wants to grow. But I wonder how often GHC is used as a library itself? If not often, then perhaps GHC should be shipped as two parts: Just a compiler (plus the small number of packages that the compiler forces), and ghc-lib as an optional, even separate, package - perhaps one with even a traditional way of depending on other packages. In otherwords, users that wanted to incorporate the ghc-lib into their programs would depend, and download, and configure, and build, ghc-lib indpenendant of the GHC binaries installed on their system. Perhaps then, GHC, the compiler, built from ghc-lib, would be bootstrapped not from the past compiler, but from the past HP.....
Okay, perhaps that is all just fantasy. But, no other programming system operates the way we do. They all fall into one of two camps:
The dominant implementation is maintained, built, and shipped along with a large collection of "common packages". Examples: Python, Ruby, PHP, Java. The dominant implementation is shipped as a bare tool, and large common libraries are maintained and shipped independently. Examples: C++ (think g++ and boost), JavaScript (think browsers, and jQuery).
We are in the middle and, I think, experiencing growing pains because of it.
- Mark
On Sat, Mar 16, 2013 at 3:42 PM, dag.odenhall@gmail.com
wrote: I'd love to have a proper parser and source-location-aware AST for sake of editor/IDE tools, so +1 from me. If you don't end up doing this after all, I'd still like to see your parser in a separate package, although I understand if you don't feel like maintaining two parsers especially given the tedious process for verifying they work similarly. I guess it could still be useful in the same way we find haskell-src-exts useful despite some incompatibilities with GHC.
On Thu, Mar 14, 2013 at 3:53 PM, Duncan Coutts
wrote: Hi folks,
I want to give you advance notice that I would like to make Cabal depend on parsec. The implication is that GHC would therefore depend on parsec and thus it would become a core package, rather than just a HP package. So this would affect both GHC and the HP, though I hope not too much.
The rationale is that Cabal needs to parse things, like .cabal files and currently we do not have a decent parser in the core libraries. By decent I mean one that can produce error messages with source locations and that doesn't have unpredictable memory use. The only parser in the core libraries at the moment is Text.ParserCombinators.ReadP from the base package and that fails my "decent" criteria on both counts. Its idea of an error message is (), and on some largish .cabal files we take 100s of MB to parse (I realise that the ReadP in the base package is a cutdown version so I don't mean to malign all ReadP-style libs out there).
Partly due to the performance problem, the terrible .cabal file error messages, and partly because Doaitse Swierstra keeps asking me if .cabal files have a grammar, I've been writing a new .cabal parser. It uses an alex lexer and a parsec parser. It's fast and the error messages are pretty good. I have reverse engineered a grammar that closely matches the existing parser and .cabal files in the wild, though I'm not sure Doaitse will be satisfied with the approach I've taken to handling layout.
Why did I choose parsec? Practicality dictates that I can only use things in the core libraries, and the nearest thing we have to that is the parser lib that is in the HP. I tried to use happy but I could not construct a grammar/lexer combo to handle the layout (also, happy is not exactly known for its great error messages).
I've been doing regression testing against hackage and I'm satisfied that the new parser matches close enough. I've uncovered all kinds of horrors with .cabal files in the wild relying on quirks of the old parser. I've made adjustments for most of them but I will be breaking a half dozen old packages (most of those don't actually build correctly because though their syntax errors are not picked up by the parser, they do cause failure eventually).
So far I've just done the outline parser, not the individual field parsers. I'll be doing those next and then integrate. So this change is still a bit of a ways off, but I thought it'd be useful to warn people now.
Duncan
_______________________________________________ cabal-devel mailing list cabal-devel@haskell.org http://www.haskell.org/mailman/listinfo/cabal-devel
_______________________________________________ cabal-devel mailing list cabal-devel@haskell.org http://www.haskell.org/mailman/listinfo/cabal-devel

On Sun, 2013-03-17 at 09:57 -0700, Mark Lentczner wrote:
This thread is raising all sorts of questions for me:
[..]
The issue of putting the yet one more HP package into GHC's core packages is increasing the exposure of the difficulty of the current GHC/HP relationship. See also threads in HP's mailing list for why can't we bump some packages in GHC's core set for the next HP release. The split arrangement is strange because we have two groups making up what is in the HP, but they have different processes and aims. The complex technical relationship between the moving parts only heightens the difficulty.
This is certainly worth thinking about. Perhaps parsec is the straw that broke the camel's back. It's not qualitatively different from the other core libs: for all of them we have the issue about version pinning and the effects on the release cycle. Duncan
participants (14)
-
Carter Schonwald
-
Conrad Parker
-
dag.odenhall@gmail.com
-
Duncan Coutts
-
Gregory Collins
-
Henning Thielemann
-
Ian Lynagh
-
Iavor Diatchki
-
Isaac Dupree
-
Joachim Breitner
-
Malcolm Wallace
-
Mark Lentczner
-
Simon Marlow
-
Simon Peyton-Jones