[Haskell-cafe] Re: [darcs-devel] announcing darcs 2.0.0pre3

25 Jan 2008

      On Wed, Jan 23, 2008 at 03:26:51PM +0000, Simon Marlow wrote:
...
There are still times when I see nothing happening, for example in the 
unpull test on the GHC repo (see previous messages), the last progress 
message I get is
Reading patches in /64playpen/simonmar/ghc-darcs2 17040
and it sits there for 7-8 seconds before completing.  Does this maybe shed 
any light on why this unpull is 2 times slower than darcs1?
I've finally got an idea what's causing this.  Actually, there are three
operations in which the hashed repositories don't scale quite as well as
the old format--all of which are "pretty fast", so it's only visible on
huge repositories like yours.

The first two scale with the number of files and directories in the
repository, and both are essentially garbage collecting of the pristine
cache.

One goes through the _darcs/pristine.hashed/ directory and compares each
file there with a list of files that should be there, deleting those that
aren't present.  This is currently O(N^2) in the number of files and
directories in the repository, because we use a list to do this set
comparison.  Using Data.Set should make this O(NlogN) which probably is
more than enough to make this negligible.  It's already pretty fast, even
on the ghc repo, so this may not even be noticeable.

The second is similar to the first.  It's when we go through the global
cache to remove any unused files.  Here we use the link count, and remove
any that have a link count of 1.  This means running stat on each file to
get the link count, so this is only O(N) where N is the total number of
distinct files and directories (i.e. having different hashes) present in
*all* repositories that you use.  It scales better than the previous one,
but if stat is slow on the cache, or if you've got very many different
large repositories, it could be slow.

The third (and possibly largest) issue is the writing of the inventory
files.  This is (thankfully) independent of the number or size of files in
the repository, and only depends on the number of patches and tags.  It's
roughly O(N+M) where N is the number of patches and tags (with a different
prefactor on the two, but who cares?), and M is the number of new or
modified patches.  This isn't *bad* scaling, but when you've got 17k
patches, O(N) adds up.  This is most likely the primary culprit, but is
tricky to fix, since as far as I can imagine, we'd need to change the
PatchSet data type (currently just a nested RL list) to cache the hash
values, and change all functions manipulating them to invalidate the cache
when they make changes.  :(

I've added progress reporting to all three of these (and it seems like it's
working).  All three could be sped up in some way (the second, perhaps just
by avoiding writing "pristine" files to the global cache, or failing to
clean them up).  But I'd rather hear from you where the pain is before
going ahead, since I've got more work right now than I can handle.

Also, if you want (way!) more information when tracking down timings, the
progress reporting now interacts with the --debug flag to generate enough
data to kill a horse.  You could also add the --timings flag, which will
add some timestamps (alas, with only 1s resolution) that might be helpful.
-- 
David Roundy
Department of Physics
Oregon State University