[Hackage] #428: cabal update should use rsync over tree instead of GET on a monolithic file

#428: cabal update should use rsync over tree instead of GET on a monolithic file ----------------------------+----------------------------------------------- Reporter: claus | Owner: Type: defect | Status: new Priority: normal | Milestone: Component: Cabal library | Version: 1.6.0.1 Severity: normal | Keywords: Difficulty: normal | Ghcversion: 6.8.3 Platform: | ----------------------------+----------------------------------------------- `cabal update` appears to download a compressed tarball (>600k and growing), which is larger than most packages, takes a while over a slow line, and doesn't provide much info (`-v` just adds low-level details). using `rsync` over the relevant parts of the directory tree ought to be faster (increasingly so as hackage keeps growing), use less bandwith, and be able to tell me which package descriptions have changed (although it would be nice to limit the latter info to new packages and changes to installed packages). -- Ticket URL: http://hackage.haskell.org/trac/hackage/ticket/428 Hackage http://haskell.org/cabal/ Hackage: Cabal and related projects

#428: cabal update uses too much bandwidth ---------------------------------+------------------------------------------ Reporter: claus | Owner: Type: defect | Status: new Priority: normal | Milestone: cabal-install-0.8 Component: cabal-install tool | Version: 1.6.0.1 Severity: normal | Resolution: Keywords: | Difficulty: hard (< 1 day) Ghcversion: 6.8.3 | Platform: ---------------------------------+------------------------------------------ Changes (by duncan): * milestone: => cabal-install-0.8 -- Ticket URL: http://hackage.haskell.org/trac/hackage/ticket/428#comment:2 Hackage http://haskell.org/cabal/ Hackage: Cabal and related projects

#428: cabal update uses too much bandwidth ---------------------------------+------------------------------------------ Reporter: claus | Owner: Type: defect | Status: new Priority: normal | Milestone: cabal-install-0.8 Component: cabal-install tool | Version: 1.6.0.1 Severity: normal | Resolution: Keywords: | Difficulty: hard (< 1 day) Ghcversion: 6.8.3 | Platform: ---------------------------------+------------------------------------------ Comment (by claus): The `cabal` tool could try for `rsync` and fall back to the current method if that isn't available/useable. That would work even for windows cygwin (and presumably msys?) users who have `rsync` installed. Alternatively, put the index dirs/files into a `darcs` repo, and have `cabal` try for `darcs` first. But why not use good old `diff` or `find` on the server side (a hackage server service that returns a list of files/dirs changed), then fetch only the files/dirs that have changed (possibly with some large cutoff - if everything has changed, it is cheaper to fetch one tar-file instead of lots of little files)? If running a server `find` for each `cabal update` turns out to be a problem, one could instead provide weekly update lists on the server, with the clients consulting as many of those as needed (fetching the whole index tarball if the local index is more than a couple of months old). -- Ticket URL: http://hackage.haskell.org/trac/hackage/ticket/428#comment:3 Hackage http://haskell.org/cabal/ Hackage: Cabal and related projects

#428: cabal update uses too much bandwidth ---------------------------------+------------------------------------------ Reporter: claus | Owner: Type: defect | Status: new Priority: normal | Milestone: cabal-install-0.8 Component: cabal-install tool | Version: 1.6.0.1 Severity: normal | Resolution: Keywords: | Difficulty: hard (< 1 day) Ghcversion: 6.8.3 | Platform: ---------------------------------+------------------------------------------ Comment (by duncan): One approach I was thinking of was providing the uncompressed tarball and mostly use it append only. So most clients could do a conditional request for the byte range from the point they have currently to the end of the file. If the cache ends up not matching then the client can just request the whole compressed tarball. That uses standard HTTP-1.1 without needing anything special on the server side which is important if we want to let people host dumb repos easily. -- Ticket URL: http://hackage.haskell.org/trac/hackage/ticket/428#comment:4 Hackage http://haskell.org/cabal/ Hackage: Cabal and related projects

#428: cabal update uses too much bandwidth ---------------------------------+------------------------------------------ Reporter: claus | Owner: Type: defect | Status: new Priority: normal | Milestone: cabal-install-0.8 Component: cabal-install tool | Version: 1.6.0.1 Severity: normal | Resolution: Keywords: | Difficulty: hard (< 1 day) Ghcversion: 6.8.3 | Platform: ---------------------------------+------------------------------------------ Comment (by igloo): FWIW, what Debian/apt does is, when making a new package list: * Run `diff -e` (Output an ed script) between the last package list and the new one * Add a line with the hash of the last package list, and the script filename to the index * Garbage collect old lines from the index as appropriate (e.g. leave at most n lines in the remove entries more than d days old, etc. In Debian it's easier as the package list is updated exactly once a day), along with the scripts that those lines point to. Then to update the index you: * Download the index * If the hash of your package list is in the index, download and apply all scripts since then * Otherwise, download the whole new package list Example index is http://ftp.uk.debian.org/debian/dists/unstable/main/binary-i386/Packages.dif... with scripts in the http://ftp.uk.debian.org/debian/dists/unstable/main/binary-i386/Packages.dif... directory. To do this for hackage, cabal-install would need to be able to apply ed scripts itself - or at least, enough of it that it can apply scripts that `diff -e` makes. -- Ticket URL: http://hackage.haskell.org/trac/hackage/ticket/428#comment:5 Hackage http://haskell.org/cabal/ Hackage: Cabal and related projects
participants (1)
-
Hackage