RE: [Haskell-cafe] fptools in darcs now available

On 28 April 2005 16:02, John Goerzen wrote:
3. Rename the smaller project's files as appropriate
4. Checkpoint here
Won't that leave a lot of useless history in each individual project? When I do 'darcs get' locally to check out a few sub-repositories, won't I get N copies of all the old history?
Yes, but they'll all be hardlinked together, so no matter how many copies you get, the old history is only stored on disk once.
If I do 'darcs get' to get a bunch of different repositories from cvs.haskell.org to my local filesystem, they won't all end up hard-linked together, surely?
I bet some darcs hacker could come up with a script that would even let us remove the patches from a descended repo that don't apply anymore.
Now that sounds like a plan.
I know it's not ideal, but trying to manually convert each part of the CVS repo, one at a time, is far, far less ideal. Not something I care to attempt, anyway :-)
I don't understand why converting each part of the repo separately is so hard? (but I've never used the tools - I'm just curious about why it's so difficult).
The other thing is that having a big repo in darcs may turn out to be not as bad as we all thought at first. The only really annoying thing there is how long it takes to do an initial darcs get (~5 min for me if I use --partial). Other than that, darcs performs quite nicely.
I agree with your list of pros/cons, and just to be clear, I'm not stuck on the idea of having a split repo. In fact, the simplicity of a single repo is very attractive. But what worries me is: if I just want to check out e.g. Haddock, I have to get the entire fptools repo (350M+, wasn't it?). I can build a source distribution with just the bits I want, but I can't get a darcs tree with anything but the whole lot. So, here's two potential solutions: 1. Make it possible to 'darcs get' just part of a tree. Patches that don't touch any files in the "live" parts of the tree are discarded. (I don't know if this is possible, or how difficult it is). 2. Create separate repositories for GHC, Happy, Haddock etc., and duplicate the shared fptools structure in each project. Each time we modify something in the shared part of the tree, we pull the patch into the other trees. (is it possible to cherry-pick from a tree that doesn't have a common ancestor? If not, can we make the repositories appear to have common ancestry?). Cheers, Simon

Simon Marlow writes:
if I just want to check out e.g. Haddock, I have to get the entire fptools repo (350M+, wasn't it?).
I guess the "best" way to do that with Darcs would be to (1) pull the fp-tools repository, (2) delete all files you don't need for Haddock, (3) pull that into your Haddock repository. So by pulling Haddock you would automatically get those parts of fptools that you need. The intermediate repository created in (2) can be deleted afterwards. Now you can pull from "fp-tools" into "Haddock" to update your build infrastructure.
1. Make it possible to 'darcs get' just part of a tree.
I might be wrong about this, but my impression is that Darcs does not support "modules" of any kind. You check out an entire repository, not less.
2. Create separate repositories for GHC, Happy, Haddock etc., and duplicate the shared fptools structure in each project. Each time we modify something in the shared part of the tree, we pull the patch into the other trees.
That's the way to do it, IMHO.
is it possible to cherry-pick from a tree that doesn't have a common ancestor?
Yes, although the merging process may be non-trivial.
If not, can we make the repositories appear to have common ancestry?).
Just pull the "common ancestor" repository into all sub-repositories, as described above. Peter

On Fri, Apr 29, 2005 at 10:39:13AM +0100, Simon Marlow wrote:
On 28 April 2005 16:02, John Goerzen wrote:
Yes, but they'll all be hardlinked together, so no matter how many copies you get, the old history is only stored on disk once.
If I do 'darcs get' to get a bunch of different repositories from cvs.haskell.org to my local filesystem, they won't all end up hard-linked together, surely?
Not automatically in that case, no. But you could use darcs optimize --relink to restore them to linked status. Or better yet: 1) Check out the most recent common ancestor 2) darcs get it n times across the local filesystem (resulting in a bunch of hardlinked patches) 3) darcs pull the appropriate repo that you want in each one of them
I know it's not ideal, but trying to manually convert each part of the CVS repo, one at a time, is far, far less ideal. Not something I care to attempt, anyway :-)
I don't understand why converting each part of the repo separately is so hard? (but I've never used the tools - I'm just curious about why it's so difficult).
These are the main issues. 1) Logical projects have changed names/paths within the repo. To properly preserve the full history of each individual project will require research and manual intervention to get right. For instance, the directory currently known as greencard used to be known as green-card. If I blindly convert only the greencard directory, the history in green-card will be forever lost to the darcs repo. Tracking all the history and spending the time to do this right could be very time-consuming. If we don't care about the full history, it becomes easier. 2) Bidirectional mirroring with CVS. It's a complex enough thing to set up to begin with, and it may not be practical to do a bidirectional mirror of only part of a CVS repo. (I don't know yet.) Here's another thought: perhaps having fptools in darcs doesn't require all the CVS history; maybe we just start with a big import, have a very specific cut-over date, and keep the CVS repo around in read-only mode after that if there's a need to find older history. That would make it pretty easy to split up into separate darcs repos.
But what worries me is: if I just want to check out e.g. Haddock, I have to get the entire fptools repo (350M+, wasn't it?). I can build a source distribution with just the bits I want, but I can't get a darcs tree with anything but the whole lot.
True. But OTOH, they will only need to download about 18MB of data. (14MB for the latest checkpoint, plus about another 4MB for the inventory file + more recent patches.) This expands to roughly 304MB on-disk, since darcs by default create two copies of the checked-out files (a pristine tree and the working tree). By way of comparison, the Linux kernel source comes as a 35MB tar.bz2 and, when built, consumes a little more space. So I don't consider it to be really out of line with what people would expect to do to participate with a major project these days. Of course, downloading a 50K checkpoint plus 10K of extra data for a small project like happy would be faster. But I'd say that the more long-term question isn't technical but organizational: how do we think fptools development will shake out in the next few years? Will we still see a lot of cross-project commits? Or will we see fragmentation, where invidual projects get adopted by different people? Also, I think it's easier to split a darcs repo than it is to join them.
So, here's two potential solutions:
1. Make it possible to 'darcs get' just part of a tree. Patches that don't touch any files in the "live" parts of the tree are discarded. (I don't know if this is possible, or how difficult it is).
That's an interesting question. It's not a darcs feature now, but I also don't know how hard it is.
2. Create separate repositories for GHC, Happy, Haddock etc., and duplicate the shared fptools structure in each project. Each time we modify something in the shared part of the tree, we pull the patch into the other trees. (is it possible to cherry-pick from a tree that doesn't have a common ancestor? If not, can we make the repositories appear to have common ancestry?).
No, you can't cherry-pick if there's no common ancestor, but you can make this appear to have a common ancestor. The idea is basically to start each one from a repo that has only the common parts, in their own directory, and then merge in the relevant patches to make each unique project. I do that with my sgml-common system, which is a set of scripts and support for building documentation and manpages from DocBook SGML sources. I use it in several of my projects and it works well. -- John

If I do 'darcs get' to get a bunch of different repositories from cvs.haskell.org to my local filesystem, they won't all end up hard-linked together, surely?
Not automatically in that case, no. But you could use darcs optimize --relink to restore them to linked status. Or better yet:
Just to be precise, if A, B and C are the repositories, optimally you'd do something like (cd B; darcs optimize --relink --sibling ../C) (cd A; darcs optimize --relink --sibling ../B --sibling ../C) This will link anything that can be linked from C into B, then anything that can be linked from either B or C into A. But you shouldn't worry about being optimal; just call ``optimize --relink'' with all the other likely repositories as siblings, and you'll end up converging to maximal sharing. Optimize --relink is relatively fast, and it should be safe, so nothing prevents you from relinking often (for example, each time you pull a new pool of changes).
1) Check out the most recent common ancestor 2) darcs get it n times across the local filesystem (resulting in a bunch of hardlinked patches) 3) darcs pull the appropriate repo that you want in each one of them
Yes, this will avoid the extra network traffic. However, you should still manually ``optimize --relink'' after doing that, as ``get'' doesn't currently link pristine trees (it only links patches). Juliusz

Also, I think it's easier to split a darcs repo than it is to join them.
...
1. Make it possible to 'darcs get' just part of a tree. Patches that don't touch any files in the "live" parts of the tree are discarded. (I don't know if this is possible, or how difficult it is).
That's an interesting question. It's not a darcs feature now, but I also don't know how hard it is.
You can currently do something like "darcs changes somepath" to get a list of patches that touched that file or directory. You can then do "darcs pull -ppatchname" for each of those patches. Of course, you can write a shell script to do that second step for each patch. I've done things like this many times. This is an example of what you call "splitting a repo", I guess. It isn't that hard. "Joining a repo" is decidedly easier: "mkdir newjoinedthing ; cd newjoinedthing ; darcs init ; darcs pull -a $firstrepo ; darcs pull -a $secondrepo". The only tricky part is "doppleganger patches". Basically at this point if you get "doppleganger patches" then you should manually intervene, figure out what the conflict is, manually fix it, and then resume. It's a big problem, because the "manual fix" probably makes it impossible to use any darcs patches which depend on (at least one of) the conflicting patches. You would, if you got into that kind of a fix, probably have to write a shell script (or a Haskell program, if you like) to run "darcs diff -u -p$patchname | ( cd newreconstructedrepo ; patch -p0 ; darcs record --all -m$patchname )" for every patch that depended on the doppleganger patches. However, avoiding doppleganger conflicts is simple: never make any identical change more than once to any darcs repo. For example, suppose you wanted to add a new subdirectory named "happy". You could go to one darcs repo and do "mkdir happy ; darcs add happy ; darcs record --all -mcreatehappydir". Now you could go to another darcs repo and do "mkdir happy ; darcs add happy ; darcs record --all -mcreatehappydir". Whoops! You just created doppleganger patches! If you ever pull from one of those repos into the other, darcs will take something like O(2^n) time where n is the number of patches that depend on the "createhappydir" patches. The solution is simply that when you want to create a "happy" dir in the other darcs repo, you cd into that repo and run "darcs pull -pcreatehappydir". Now you have a happy dir in both of the repos, and you don't have any doppleganger patches. The same caution applies to adding files, changing files (such as with "patch -p0 < newfeature.diff"), etc. You must do such things only once, in one repo, and then pull the change through darcs into any other repo that wants the change. As long as you avoid doppleganger patches, then you guys are worrying about this too much. Darcs will make it relatively easy for you to do it this way, do it that way, or change your mind halfway through and move everything around again. It isn't like CVS, where you have to agonize about putting everything in the right place and the start and then live with it for years or undergo painful transitions. Regards, Zooko

I wrote:
You can currently do something like "darcs changes somepath" to get a list of patches that touched that file or directory. You can then do "darcs pull -ppatchname" for each of those patches. Of course, you can write a shell script to do that second step for each patch. I've done things like this many times.
This is an example of what you call "splitting a repo", I guess. It isn't that hard.
Oh, and I forgot to mention the even easier way. If you are lucky enough to have all of the patches that you *don't* want dependent on one patch... For example, suppose that there are two patches, one which creates "happy" and one which creates "sad". Now suppose all of the other patches in the repo added files inside one or the other of these subdirectories but not both. Then if you want just the happy parts and not the sad, you simply do this: mkdir newhappyrepo ; cd newhappyrepo ; darcs init darcs pull $origrepo Now darcs will show you the patch that adds "happy", asking if you want that patch. You hit 'y'. Now darcs will show you the patch that adds "sad", asking if you want that patch. You hit 'n'. Now darcs will show you *only* patches which did not add anything into sad. So you can hit 'a' to get all remaining patches. Regards, Zooko

On Fri, Apr 29, 2005 at 11:17:07AM -0300, zooko@zooko.com wrote:
The only tricky part is "doppleganger patches". Basically at this point if you get "doppleganger patches" then you should manually intervene, figure out what the conflict is, manually fix it, and then resume. It's a big problem,
I don't know if this is what you call a "doppleganger patch", but a lot of times when I try this, I have a problem because files with the same name are added at some point in the history of both trees (even if the current versions don't conflict), and this leads to the spinning conflict resolution. Commonly seen with files named Makefile and the like. -- John

I don't know if this is what you call a "doppleganger patch", but a lot of times when I try this, I have a problem because files with the same name are added at some point in the history of both trees (even if the current versions don't conflict), and this leads to the spinning conflict resolution.
Commonly seen with files named Makefile and the like.
Yes, I believe this is an instance of the same problem. This can be avoided by adding them in differently-named subdirectories in the first place. Regards, Zooko

On Fri, Apr 29, 2005 at 08:38:05AM -0500, John Goerzen wrote:
On Fri, Apr 29, 2005 at 10:39:13AM +0100, Simon Marlow wrote:
So, here's two potential solutions:
1. Make it possible to 'darcs get' just part of a tree. Patches that don't touch any files in the "live" parts of the tree are discarded. (I don't know if this is possible, or how difficult it is).
That's an interesting question. It's not a darcs feature now, but I also don't know how hard it is.
It's hard, but maybe not impracticably hard. We'd need to store an index of which patches affect which directories (and perhaps files). And we'd need to modify the application of patches to work when parts of those patches can't be applied (because they apply to files or directories which aren't present. On the plus side, the index of patches affecting files and directories would be hugely useful for speeding up annotate, whose slowness is a moderately common complaint about darcs, so this part of the work wouldn't necesarily be "wasted". And in fact this could be done and tested and used independently from the "get part of a tree" feature. The second idea is a bit weirder and more contrary to darcs' philosophy. Perhaps this could be done using stubs to indicate that the file isn't present. You run into trouble when there is a patch move ./foo/file ./bar/file if the foo directory isn't present, but the bar directory is. If we carried around stubs of some sort (a non-portable implementation idea would be a symlink to "MAGIC_DARCS_STUB_WORD"), we could just add a "not present" stub as ./bar/file, and let the user yell at whoever moved a file between "modules". Or perhaps there'd be a way to "check out" the directories or files you don't have present, which you could use when you see stub files showing up in your desired directories.
2. Create separate repositories for GHC, Happy, Haddock etc., and duplicate the shared fptools structure in each project. Each time we modify something in the shared part of the tree, we pull the patch into the other trees. (is it possible to cherry-pick from a tree that doesn't have a common ancestor? If not, can we make the repositories appear to have common ancestry?).
No, you can't cherry-pick if there's no common ancestor, but you can make this appear to have a common ancestor. The idea is basically to start each one from a repo that has only the common parts, in their own directory, and then merge in the relevant patches to make each unique project. I do that with my sgml-common system, which is a set of scripts and support for building documentation and manpages from DocBook SGML sources. I use it in several of my projects and it works well.
Yes, this is a nice way of doing things. The catch is that it all breaks down when someone records a patch that touches both the common parts and a specific part. Perhaps a clever script could check for such a scenario and reject those patches. A *seriously* clever script (hook in darcs?) could prevent it from happening in the first place, which would be best. -- David Roundy http://www.darcs.net

Simon Marlow wrote:
But what worries me is: if I just want to check out e.g. Haddock, I have to get the entire fptools repo (350M+, wasn't it?). I can build a source distribution with just the bits I want, but I can't get a darcs tree with anything but the whole lot.
So, here's two potential solutions:
1. Make it possible to 'darcs get' just part of a tree. Patches that don't touch any files in the "live" parts of the tree are discarded. (I don't know if this is possible, or how difficult it is).
I like this solution, especially now that David says that it is not impossible to do. In general I think it is a good idea to be able to get a part of the tree -- this might be very useful to handle big projects like the linux kernel where many developers just need to touch tiny parts of the repository. However, I think that darcs should never "discard" patches: all patches are always applied, and record works just as it normally works. The only difference is that the absent part of the tree is treated as an unobservable part of the tree -- patch applications to absent parts of the tree are just void operations as they can not be observed. In this design, a "darcs get" on a part of the tree is like building a special view on the tree. (As such, the tree should probably still always start from the root -- one would not be able to just get a bunch of leaves) In this setup, I think darcs will still be able to transparently handle patches that touch present and absent parts of the tree, and also moves from absent to present parts etc. In general, this feature might allow darcs to overcome most efficiency problems associated with large repositories. Alas, I do not know how much effort this feature might take (and I do not volunteer to do it), but it does seem a potentially important one. All the best, -- Daan Leijen.
2. Create separate repositories for GHC, Happy, Haddock etc., and duplicate the shared fptools structure in each project. Each time we modify something in the shared part of the tree, we pull the patch into the other trees. (is it possible to cherry-pick from a tree that doesn't have a common ancestor? If not, can we make the repositories appear to have common ancestry?).
Cheers, Simon
_______________________________________________ darcs-users mailing list darcs-users@darcs.net http://www.abridgegame.org/mailman/listinfo/darcs-users
participants (7)
-
Daan Leijen
-
David Roundy
-
John Goerzen
-
Juliusz Chroboczek
-
Peter Simons
-
Simon Marlow
-
zooko@zooko.com