Proposal: better library management ideas (was: how to checkout proper submodules)

9 Jun 2013

      So, there seems to be a fairly clear majority favor in terms of doing
something I think. The question then, is what. I'm fairly convinced
from Ian's response earlier that submodules *can* be dangerous if
you're using a lot of high-traffic packages, especially the ability to
trample each other might be bad. I could see this happening for base
for example if two people are working on large features and do not
coordinate a merge. Git's own merge facility doesn't suffer nearly as
bad from this problem and we can figure out how we want that to happen
later.

However, it seems like every high-volume package is for better or
worse, intimately tied to GHC. These packages are also the most
problematic to rollback 'in sync' with GHC. As Geoffrey mentioned,
this also becomes even MORE impossible if you use merges without
fast-forwards or rebases, because dates no longer correlate
accurately. These package include base, and testsuite. Probably nofib
as well. In some sense, I agree with Malcolm that 'base' being GHC
only is maybe unfortunate. But maybe it's not (I'll talk more about
this later,) and maybe in the mean time we shouldn't lie to ourselves.

So first off, I'd like to propose something I guess, which seems, to
me, the best approach for one if we want to avoid developer pain with
as many wins as possible in the long run. I hope this doesn't sound
actively radical or anything, but it's going to totally sound actively
radical (though I don't think it is):

--> Let's just put base and testsuite inside the GHC repository
directly. No submodules, no floating repos. Just put it directly
inside and make a super commit, I guess. GHC becomes the de facto
repository. And hey, why not nofib?

I know, I know. People really want to split the maintenance burdens I
guess, and ideologically the Haskell community is all about clean
separation but, please? All of GHC HQ are the de facto maintainers of
this stuff anyway. And as Jan mentioned, testsuite is really *so*
crucial GHC should have it inline. The testsuite is perhaps the most
important of all.

There are other candidates for this treatment too, really. For
example, why is template-haskell, ghc-prim, and hpc split out? GHC is
the only thing that supports them. template-haskell is especially
super-intrusive of an extension to support, and arguably hpc as well.
integer-simple and integer-gmp follow the exact same story. Same with
hoopl and dph. They're all ours. We own them. Just put them all inside
GHC and be done with it. Having active fragmentation in the VCS is not
necessary when there need be none. These packages de-facto ship with
GHC and are very tied to it.

I think people might be really opposed to a mega repository or
something, but honestly? There's less maintenance, cross-package
changes can work correctly and be tracked correctly in terms of
history. It's less work for maintainers. It's less to explain and
frankly, less to mess up. All of this I think is a huge win.

OK, so radical idea is out there. Let's look at some numbers. I think
ultimately anything will be a bit painful, because...

$ cd ~/ghc/ghc-work
$ grep -v "\#" ./packages | head --lines="-1" | wc -l
39

There are 39 sub packages which GHC requires (the -1 is because GHC
itself is listed as the final entry.) These aren't all libraries of
course. But that's a massive number of dependencies really, so
managing them is a pain.

How many are submodules already?

$ grep -v "\#" packages | head --lines="-1" | awk '{print $3}' | grep
"^-" | wc -l
14

So there are 14 submodules, and 25 packages that are free floating.
This is a very very large amount of dependent packages. I guess that's
just the price we pay.

Let's say that hypothetically, we fold all those packages I said into
GHC (base, testsuite, nofib, template-haskell, integer-simple,
integer-gmp, hpc, ghc-prim.) That leaves 14 submodules and 17
floaters.

I actually believe that most of the submodules right now are a fairly
good trade off, because as designed they've all got upstreams. That's
good. But what about things that are *not* submodules?

Let's look at the commits over all the floaters in 1 year. The command
is "git log --since="1 year ago" --format=oneline . | wc -l"

* ghc-tarballs: 1
* hsc2hs: 11
* haddock: 147
* array: 10
* base: 306
* deepseq: 6
* directory: 19
* filepath: 3
* ghc-prim: 9
* haskell98: 11
* haskell2010: 7
* hoopl: 13
* hpc: 13
* integer-gmp: 29
* integer-simple: 8
* old-time: 5
* old-locale: 3
* process: 40
* template-haskell: 19
* unix: 32
* testsuite: 825
* nofib: 50
* parallel: 5
* stm: 31
* dph: 95

Remember, a lot of the commits in several of these repositories are
somewhat closely tied to GHC commits. Testsuite especially, so the
numbers lie a little. But *now* let's take out all the ones we wanted
to fold in.

* ghc-tarballs: 1
* hsc2hs: 11
* haddock: 147
* array: 10
* deepseq: 6
* directory: 19
* filepath: 3
* haskell98: 11
* haskell2010: 7
* old-time: 5
* old-locale: 3
* process: 40
* unix: 32
* parallel: 5
* stm: 31

These are all incredibly low traffic with the exception of haddock,
because I was generous and listed it anyway (even though I shouldn't
because it uses the GHC API.) stm/parallel are also pretty generous
I'd say.

Now let's think about this. Most of these could be converted to
submodules with very little loss possibly. They are not very actively
touched in the process of most development cycles and after looking at
a lot of the changes. It's unlikely you'll hit many merge conflicts or
weird situations. And even if you do, it's probably not going to
happen *often*. It's even possible a lot of these could also become
upstreams with separate maintainers. A lot of these are not dependent
on GHC necessarily in theory or practice possibly: unix, process,
deepseq, array, directory, filepath, etc. Someone could maintain them
and developers work with them. Would anyone want to be a maintainer?
(I heard some people clamoring for GitHub. Become a maintainer and you
can host it where you want :P)

Or we could also fold them in too - mega repository style - and just
say GHC HQ is the de-facto maintainer, as it is now. If someone wants
to step up, we can split it out later. That would just leave 14 sub
repositories which are pretty well taken care of with upstreams. Maybe
a few more if some people come onboard and can maintain things. This
would reduce our problems a lot I feel. Other things like ./sync-all
could change to support branching and other basic multi-repo
facilities as Jan said, and that's not totally unreasonable either I
think. It's about making the normal case easy.

We're often concerned with things being at the right granularity and
sharing stuff maybe, but I think the trend is pretty frighteningly
clear at this point in time - GHC is the de facto implementation of
Haskell, and the number of maintainers isn't especially high. And
maintaining it is a lot of work (it's truly a World Class™ programming
language implementation, after all.) And having 39 repositories is
scary. If that's the case, I'd say we should optimize where it counts
and minimize our own burden and make it easy to track our changes, and
make our workflows as simple as possible. Yes, hypothetically a
competitor can come along and give us a run for our money and maybe
they'll want to use base and the testsuite and all that other stuff
and we'll own it and whatnot. And duplication of work etc etc. And
that'll be sad.

Or not. And they'll do their complete own thing and run with it. UHC
has its own base and testsuite, as does JHC for example. Perhaps
sharing things like that is the exception, not the rule or regular
occurrence. Ultimately a software project is as much about ideals, and
what we believe is worth working on with our time - just as it is
about what code you're writing or using right now. Perhaps we should
not hinge our development strategies on these tactics any longer when
the pattern seems to be darn clear.

This proposal is fairly radical. It would require the agreement of
almost every single developer, because several of us have varying
degrees of ownership over parts of the source that concern all of
them. But like I said, it seems the majority would agree something
should change, and I don't think we should give up finding it, so
let's just see where our ideas take us. And I think the wins would be
enormous.

I also appreciate you all dealing with the novels I've written over
the past few days.

-- 
Regards,
Austin - PGP: 4096R/0x91384671

Austin Seipp

Roman Cheplyaka

Jan Stolarek

Austin Seipp

Jan Stolarek

Jan Stolarek

Austin Seipp

Ian Lynagh

Simon Marlow

Simon Peyton-Jones

Roman Cheplyaka

John Lato

Roman Cheplyaka

John Lato

Simon Peyton-Jones

Roman Cheplyaka

Daniel Trstenjak

Geoffrey Mainland

Ian Lynagh

Nicolas Trangez

Geoffrey Mainland

Daniel Trstenjak

Ian Lynagh

Daniel Trstenjak

Ian Lynagh

Geoffrey Mainland

Roman Cheplyaka

tags

participants (10)