Re: [Hackage] #288: the package indexes are very slow

#288: the package indexes are very slow ---------------------------------+------------------------------------------ Reporter: duncan | Owner: Type: defect | Status: new Priority: normal | Milestone: Component: cabal-install tool | Version: HEAD Severity: normal | Keywords: Difficulty: easy (<4 hours) | Ghcversion: 6.8.2 Platform: | ---------------------------------+------------------------------------------ Comment(by duncan): Replying to [comment:5 AntoineLatter]:
Experience report:
Taking the tar-index from hackage-server wasn't too hard. It doesn't scale to hackage-sized tarballs, though - it is only able to store the offsets for about half of the .cabal files in 00-index.tar.
It should be straightforward to extend the size of the types used to cope with bigger tarballs. The only cost will be a bigger index. The reason for the limitations in the hackage code is simply to save space by keeping the indexes very compact. -- Ticket URL: http://hackage.haskell.org/trac/hackage/ticket/288#comment:6 Hackage http://haskell.org/cabal/ Hackage: Cabal and related projects

On Wed, Jul 14, 2010 at 4:48 PM, Hackage
Comment(by duncan):
It should be straightforward to extend the size of the types used to cope with bigger tarballs. The only cost will be a bigger index. The reason for the limitations in the hackage code is simply to save space by keeping the indexes very compact.
Yup. My branch of cabal-install has this done: http://community.haskell.org/~aslatter/code/cabal-install/index/ which requires: http://community.haskell.org/~aslatter/code/tarindex/ This should be considered a prototype, as we would need to more carefully consider what dependencies we should be pulling in for this. For "cabal list $pkgname" I go from ~1s to ~550ms. For "cabal install --dry-run happstack" I go from ~2.5s to ~1.9s. According to GHC profiling most of the time spent in "cabal list" is in looking up paths in the tar-index offset store (this happens ~9000 times). I don't have profiling data through package boundaries for some reason, so there might yet be low-hanging fruit over there. Antoine

On Wed, 2010-07-14 at 22:37 -0500, Antoine Latter wrote:
On Wed, Jul 14, 2010 at 4:48 PM, Hackage
wrote: Comment(by duncan):
It should be straightforward to extend the size of the types used to cope with bigger tarballs. The only cost will be a bigger index. The reason for the limitations in the hackage code is simply to save space by keeping the indexes very compact.
Yup. My branch of cabal-install has this done:
http://community.haskell.org/~aslatter/code/cabal-install/index/
which requires:
http://community.haskell.org/~aslatter/code/tarindex/
This should be considered a prototype, as we would need to more carefully consider what dependencies we should be pulling in for this.
For "cabal list $pkgname" I go from ~1s to ~550ms. For "cabal install --dry-run happstack" I go from ~2.5s to ~1.9s.
According to GHC profiling most of the time spent in "cabal list" is in looking up paths in the tar-index offset store (this happens ~9000 times). I don't have profiling data through package boundaries for some reason, so there might yet be low-hanging fruit over there.
Nice, I'll try and find some time to look at this. Duncan

On Wed, Jul 14, 2010 at 10:37 PM, Antoine Latter
http://community.haskell.org/~aslatter/code/cabal-install/index/
which requires:
For "cabal install --dry-run happstack" I go from ~2.5s to ~1.9s.
Ditching Numeric.readOct for parsing tar headers brought this down to ~1.7s, which brings us down to where our biggest line-item is outside of the TarIndex functionality. We still spend ~33% of our time in indexing in some form. If I get time I'll look into getting rid of Numeric.readOct in the main-line (Data|Distribution.Client).Tar package. Antoine
participants (3)
-
Antoine Latter
-
Duncan Coutts
-
Hackage