
I very much support moving to all-submodules. In fact, I argued for all-submodules when we made the half-submodules transition last year. Being able to easily check out a consistent and complete source code tree in a repeatable way is extremely important. Checking out by date "works" if you have dated history in your git reflog. For example, see: http://stackoverflow.com/questions/6990484/git-checkout-by-date In general, git commits are *not* time ordered, so asking for the version at a particular time is not well-defined across different working repositories. The GHC HQ buildbots dump fingerprints in a form that is usable directly with fingerprint.py. You can get these fingerprints from the ghc-builds@ archive. Unfortunately there was a large gap after MSR moved buildings where our builds did not run, but things are more or less working now. I believe Ben's buildbot package dumps fingerprints in a form that needs to be massaged before fingerprints.py can deal with it. Geoff On 06/05/2013 11:32 AM, Niklas Larsson wrote:
When I was fiddling with having to rollback everything to a known good state I patched sync-all to checkout all the repos to the state they were in on a certain date, it's pretty naive, but it should be usable for doing manual bisecting at least. I can't find the old mailing list archives, so I attach the patch here.
Niklas
2013/6/5 Austin Seipp
(Warning: incoming answer, followed by a rant.)
Base is not a submodule, meaning that there is essentially no way to automatically check it back out to the "exact same state" it was in, given some specified GHC commit - the commit IDs are not tracked.
At this point, you are basically on your own. You'll have to manually checkout libraries/base to a specific commit that occurred 'around' the same time as the GHC commit. In this case, that means looking through whatever commits hit HEAD on May 7th:
$ cd libraries/base $ git log --until="May 7th"
The resulting list will show you what happened up to may 7th. Take the latest commit in that list, and check out base to that revision. Any commits afterword happened on may 8th or later:
$ git checkout -b temporary-io-fix
You're going to need to do this for every module that is not tracked as a submodule. Most of the repositories are very low-activity. base & testsuite are going to be the annoying ones.
You'll have to continue this 'manual bisection' by hand, with a very hefty dose of frustrating trial-and-error, in my experience.
There is a secondary alternative. GHC has a script called 'fingerprint.py' (in utils/fingerprint/) which is somewhat designed to work around this deficiency (very poorly.) This script basically dumps out a text file, containing a key/value pair mapping every repository to its current HEAD commit. It can then take that text file and automatically do 'git checkout' for you in every repo. The idea is you can take fingerprints of the tree, save the results, and cleanly check out to some state later.
The GHC build bots run by Ben L.'s "Buildbox" library automatically runs the 'fingerprint.py' script during nightly-builds, from what I remember. It may be possible to just look in the ghc-builds archives, and steal some fingerprints from the last month off one of the buildbots. I don't know who maintains the individual bots; perhaps you can ask the list. However, this will at best give you a 1-day level of granularity, rather than commit level granularity, which is still rather unsatisfying.
------------- Answer over, rant begins. ---------------------
I know we had this discussion sometime recently I think, but can someone *please* explain why we are in this situation of half submodules, half random-floating-git-repository-checkouts? It's terrible. I'm frankly surprised we've even been doing it this long, over a year or more? It is literally the worst of submodules, and free-standing-repositories put together, with none of the advantages of either.
Free-standing repos are attractive because they are just there, and you don't have to 'maintain' them (sort of.) Submodules are attractive because they identify the critical points in which your repositories depend on each other. We have neither benefit right now, clearly.
In particular, this makes it impossible to use tools like 'git bisect' which is *incredibly* useful for just these exact cases. Hell, you can even make 'git bisect' work almost 100% automatically with a tiny bit of shell scripting.
http://mainisusuallyafunction.blogspot.com/2012/09/tracking-down-unused-vari...
You could just instead have a script that built the compiler, and ran the built compiler on your testcase, after every bisection. Wouldn't it be *great* to have something like that Just Work? A tool like this could potentially boil down Kazu's bug almost automatically for example, with little-to-no frustrating intervention.
And even now, looking at the repository listing of what is in libraries/, that are not submodules, I really see no reason why more - or even all - of them cannot be submodules. Is it a workflow issue of some sort? That's what I'm thinking at this point, but I also don't think it could be any worse than it is now.
Realistically, very few libraries GHC needs for bootstrapping seem to change that much. unix, integer-simple, haskeline and filepath for example change *extremely* infrequently, but all are free-standing. Why? In the event they were submodules, would anything actually be lost?
The maintainer - that is, not GHC HQ - would still 'own' the official repository. They can make changes to it. But if there is a necessity to pull that in for GHC (feature request, bug fix, random thing) it can be done by updating the submodule pointer to the new commit. But this must happen explicitly by a GHC committer. In the event they update the submodule pointer, they should also obviously make sure the build still works.
That means we have to update the submodule pointers ourselves if things change. That sucks I guess, but really, aside from base and testsuite, the two most frequently changing repositories, is that *actually* going to cost us a lot of work?
And even if it does cost us work, I'll speak for myself: I will gladly pay for that work and do it all myself if it means I can actually bisect and actually roll back my tree to some point to fix things - without needing to prepare for it months in advance using hacks. Like creating thousands of fingerprints, using fingerprint.py every day when people make commits (no, I haven't done this, but it could be done, and I really don't want to do it.)
Long-term reproducible builds are, IMO, a must for any project. *Especially* a project of our size. *Especially* a compiler of all things. But as it stands, when you build GHC, you can probably reproduce *today's* results and *today's* bugs. Last month's results? Last years? Finding the difference between those months ago and today? Good luck - you will need it.
On Tue, Jun 4, 2013 at 8:07 PM, Kazu Yamamoto
wrote: > Hi, > > Andreas and I found that the new IO manager is not working
properly in
> the current GHC head. I'm sure that it worked well at least on
May 7.
> > We need to narrow the range of commits, so I did: > > % git checkout bb2795db36b36966697c228315ae20767c4a8753 > % git submodule update > > But this does not checkout proper submodules. For instance, > libraries/base has newer commits. And of cource, building fails. > > Please tell us how to checkout proper submodules against a specific > GHC tree. > > --Kazu
-- Regards, Austin - PGP: 4096R/0x91384671