
On Wed, Jan 23, 2008 at 03:26:51PM +0000, Simon Marlow wrote:
There are still times when I see nothing happening, for example in the unpull test on the GHC repo (see previous messages), the last progress message I get is
Reading patches in /64playpen/simonmar/ghc-darcs2 17040
and it sits there for 7-8 seconds before completing. Does this maybe shed any light on why this unpull is 2 times slower than darcs1?
I've finally got an idea what's causing this. Actually, there are three operations in which the hashed repositories don't scale quite as well as the old format--all of which are "pretty fast", so it's only visible on huge repositories like yours. The first two scale with the number of files and directories in the repository, and both are essentially garbage collecting of the pristine cache. One goes through the _darcs/pristine.hashed/ directory and compares each file there with a list of files that should be there, deleting those that aren't present. This is currently O(N^2) in the number of files and directories in the repository, because we use a list to do this set comparison. Using Data.Set should make this O(NlogN) which probably is more than enough to make this negligible. It's already pretty fast, even on the ghc repo, so this may not even be noticeable. The second is similar to the first. It's when we go through the global cache to remove any unused files. Here we use the link count, and remove any that have a link count of 1. This means running stat on each file to get the link count, so this is only O(N) where N is the total number of distinct files and directories (i.e. having different hashes) present in *all* repositories that you use. It scales better than the previous one, but if stat is slow on the cache, or if you've got very many different large repositories, it could be slow. The third (and possibly largest) issue is the writing of the inventory files. This is (thankfully) independent of the number or size of files in the repository, and only depends on the number of patches and tags. It's roughly O(N+M) where N is the number of patches and tags (with a different prefactor on the two, but who cares?), and M is the number of new or modified patches. This isn't *bad* scaling, but when you've got 17k patches, O(N) adds up. This is most likely the primary culprit, but is tricky to fix, since as far as I can imagine, we'd need to change the PatchSet data type (currently just a nested RL list) to cache the hash values, and change all functions manipulating them to invalidate the cache when they make changes. :( I've added progress reporting to all three of these (and it seems like it's working). All three could be sped up in some way (the second, perhaps just by avoiding writing "pristine" files to the global cache, or failing to clean them up). But I'd rather hear from you where the pain is before going ahead, since I've got more work right now than I can handle. Also, if you want (way!) more information when tracking down timings, the progress reporting now interacts with the --debug flag to generate enough data to kill a horse. You could also add the --timings flag, which will add some timestamps (alas, with only 1s resolution) that might be helpful. -- David Roundy Department of Physics Oregon State University