Re: how to checkout proper submodules

5 Jun 2013

      I very much support moving to all-submodules. In fact, I argued for
all-submodules when we made the half-submodules transition last
year. Being able to easily check out a consistent and complete source
code tree in a repeatable way is extremely important.

Checking out by date "works" if you have dated history in your git
reflog. For example, see:

http://stackoverflow.com/questions/6990484/git-checkout-by-date

In general, git commits are *not* time ordered, so asking for the
version at a particular time is not well-defined across different
working repositories.

The GHC HQ buildbots dump fingerprints in a form that is usable directly
with fingerprint.py. You can get these fingerprints from the ghc-builds@
archive. Unfortunately there was a large gap after MSR moved buildings
where our builds did not run, but things are more or less working now. I
believe Ben's buildbot package dumps fingerprints in a form that needs
to be massaged before fingerprints.py can deal with it.

Geoff

On 06/05/2013 11:32 AM, Niklas Larsson wrote:
...
When I was fiddling with having to rollback everything to a known good
state I patched sync-all to checkout all the repos to the state they
were in on a certain date, it's pretty naive, but it should be usable
for doing manual bisecting at least. I can't find the old mailing list
archives, so I attach the patch here.
Niklas
2013/6/5 Austin Seipp 
(Warning: incoming answer, followed by a rant.)
Base is not a submodule, meaning that there is essentially no way to
    automatically check it back out to the "exact same state" it was in,
    given some specified GHC commit - the commit IDs are not tracked.
At this point, you are basically on your own. You'll have to manually
    checkout libraries/base to a specific commit that occurred 'around'
    the same time as the GHC commit. In this case, that means looking
    through whatever commits hit HEAD on May 7th:
$ cd libraries/base
    $ git log --until="May 7th"
The resulting list will show you what happened up to may 7th. Take the
    latest commit in that list, and check out base to that revision. Any
    commits afterword happened on may 8th or later:
$ git checkout -b temporary-io-fix 
You're going to need to do this for every module that is not tracked
    as a submodule. Most of the repositories are very low-activity. base &
    testsuite are going to be the annoying ones.
You'll have to continue this 'manual bisection' by hand, with a very
    hefty dose of frustrating trial-and-error, in my experience.
There is a secondary alternative. GHC has a script called
    'fingerprint.py' (in utils/fingerprint/) which is somewhat designed to
    work around this deficiency (very poorly.) This script basically dumps
    out a text file, containing a key/value pair mapping every repository
    to its current HEAD commit. It can then take that text file and
    automatically do 'git checkout' for you in every repo. The idea is you
    can take fingerprints of the tree, save the results, and cleanly check
    out to some state later.
The GHC build bots run by Ben L.'s "Buildbox" library automatically
    runs the 'fingerprint.py' script during nightly-builds, from what I
    remember. It may be possible to just look in the ghc-builds archives,
    and steal some fingerprints from the last month off one of the
    buildbots. I don't know who maintains the individual bots; perhaps you
    can ask the list. However, this will at best give you a 1-day level of
    granularity, rather than commit level granularity, which is still
    rather unsatisfying.
------------- Answer over, rant begins. ---------------------
I know we had this discussion sometime recently I think, but can
    someone *please* explain why we are in this situation of half
    submodules, half random-floating-git-repository-checkouts? It's
    terrible. I'm frankly surprised we've even been doing it this long,
    over a year or more? It is literally the worst of submodules, and
    free-standing-repositories put together, with none of the advantages
    of either.
Free-standing repos are attractive because they are just there, and
    you don't have to 'maintain' them (sort of.) Submodules are attractive
    because they identify the critical points in which your repositories
    depend on each other. We have neither benefit right now, clearly.
In particular, this makes it impossible to use tools like 'git bisect'
    which is *incredibly* useful for just these exact cases. Hell, you can
    even make 'git bisect' work almost 100% automatically with a tiny bit
    of shell scripting.
http://mainisusuallyafunction.blogspot.com/2012/09/tracking-down-unused-vari...
...
You could just instead have a script that built the compiler, and ran
    the built compiler on your testcase, after every bisection. Wouldn't
    it be *great* to have something like that Just Work? A tool like this
    could potentially boil down Kazu's bug almost automatically for
    example, with little-to-no frustrating intervention.
And even now, looking at the repository listing of what is in
    libraries/, that are not submodules, I really see no reason why more -
    or even all - of them cannot be submodules. Is it a workflow issue of
    some sort? That's what I'm thinking at this point, but I also don't
    think it could be any worse than it is now.
Realistically, very few libraries GHC needs for bootstrapping seem to
    change that much. unix, integer-simple, haskeline and filepath for
    example change *extremely* infrequently, but all are free-standing.
    Why? In the event they were submodules, would anything actually be
    lost?
The maintainer - that is, not GHC HQ - would still 'own' the official
    repository. They can make changes to it. But if there is a necessity
    to pull that in for GHC (feature request, bug fix, random thing) it
    can be done by updating the submodule pointer to the new commit. But
    this must happen explicitly by a GHC committer. In the event they
    update the submodule pointer, they should also obviously make sure the
    build still works.
That means we have to update the submodule pointers ourselves if
    things change. That sucks I guess, but really, aside from base and
    testsuite, the two most frequently changing repositories, is that
    *actually* going to cost us a lot of work?
And even if it does cost us work, I'll speak for myself: I will gladly
    pay for that work and do it all myself if it means I can actually
    bisect and actually roll back my tree to some point to fix things -
    without needing to prepare for it months in advance using hacks. Like
    creating thousands of fingerprints, using fingerprint.py every day
    when people make commits (no, I haven't done this, but it could be
    done, and I really don't want to do it.)
Long-term reproducible builds are, IMO, a must for any project.
    *Especially* a project of our size. *Especially* a compiler of all
    things. But as it stands, when you build GHC, you can probably
    reproduce *today's* results and *today's* bugs. Last month's results?
    Last years? Finding the difference between those months ago and today?
    Good luck - you will need it.
On Tue, Jun 4, 2013 at 8:07 PM, Kazu Yamamoto  wrote:
    > Hi,
    >
    > Andreas and I found that the new IO manager is not working
properly in
...
> the current GHC head. I'm sure that it worked well at least on
May 7.
...
>
    > We need to narrow the range of commits, so I did:
    >
    >   % git checkout bb2795db36b36966697c228315ae20767c4a8753
    >   % git submodule update
    >
    > But this does not checkout proper submodules. For instance,
    > libraries/base has newer commits. And of cource, building fails.
    >
    > Please tell us how to checkout proper submodules against a specific
    > GHC tree.
    >
    > --Kazu
--
    Regards,
    Austin - PGP: 4096R/0x91384671

Re: how to checkout proper submodules

Geoffrey Mainland