
Hi all, I'm planning to spend some time, on behalf of the Industrial Haskell Group, working on Hackage 2 in the coming weeks. As such, I've been trying to work out what the blockers are in terms of actually getting the hackage 2 server live. I've started from the "Current TODOs" section of http://hackage.haskell.org/trac/hackage/wiki/HackageDB/2.0 and the rest of this mail is a brief description of what I've found out, and my conclusions. Active tickets against Hackage 2 -------------------------------- There are currently 8 tickets: #911 Package uploading is completely unsecured high #916 Verify that HTTP interface is fully/properly implemented high #918 Working documentation builder high #426 .cabal files should be stored next to tarballs, potentially overriding in-tarball version normal #913 New HTML theme normal #914 Fix acid-state usage normal #919 hackage-mirror should handle errors more gracefully normal #915 Convert to modular use of type-safe URLs low Now #913 I assume is not a blocker. #919 I assume is also not a blocker. And #914 and #915 are improvements to the internals, so presumably also not blockers. #426 is not supported by Hackage 1 as far as I can see, so is not a blocker as it is not a regression. So that leaves 3 tickets as blockers: #911: We need to do something here. With Hackage 1, it takes manual approval before you can upload packages, and at the very least Hackage 2 should match that. I have the impression that that is already possible (by restricting package upload to a group, and requiring accounts to be added to that group by an admin), but I haven't confirmed that yet. #916: At the very least, Hackage 2 needs to support URLs that Hackage 1 supported (unless a conscious decision has been made not to). Ideally we would get the URLs right on the initial release, so that people don't start using the wrong ones. Doesn't sound hard, so may as well do before the switchover. #918: This is the main missing functionality currently missing from the user's point of view. HackageDB --------- I don't think there are any blockers on this page. HackageToDo ----------- Package builds: Not a regression, so not a blocker. Haddocking: See #918 above. Hoogle database: Not a regression, so not a blocker. HsColour: Hackage 1 supports this. Should be done as part of #918. Testing builds: Not a regression, so not a blocker. Running testsuites: Not a regression, so not a blocker. Packages properties: Meta info is already available, so not a blocker. Queries: Not a regression, so not a blocker. Show respository: Small regression; May as well fix. README markup etc: Not a regression, so not a blocker. Uploads: All sub opints either mentioned already or new features. Not blockers. DOAP: Not a regression, so not a blocker. Gateways: Should support the existing "Distributions" files. Documentation needs love ------------------------ I'm not sure whether user or developer documentation is intended here. Without more details, I'll treat this as not a blocker. Further context --------------- I didn't see any other blockers in the old mails linked to. Conclusion ---------- I think the following are the blockers for deploying Hackage 2: * #911 upload perms; may be good enough already * #916 check URLs are OK * #918 build haddock (and HsColour) docs * Show source respository on package pages * Support the existing "Distributions" files, and show info on package pages (plus enough testing to give us confidence in it, of course). Does that match other people's opinions? Did I miss anything? Thanks Ian

On Mon, 2012-07-02 at 12:25 +0100, Ian Lynagh wrote:
Hi all,
I'm planning to spend some time, on behalf of the Industrial Haskell Group, working on Hackage 2 in the coming weeks.
[..]
Now #913 I assume is not a blocker. #919 I assume is also not a blocker. And #914 and #915 are improvements to the internals, so presumably also not blockers. #426 is not supported by Hackage 1 as far as I can see, so is not a blocker as it is not a regression.
I agree with this analysis.
So that leaves 3 tickets as blockers:
#911: We need to do something here. With Hackage 1, it takes manual approval before you can upload packages, and at the very least Hackage 2 should match that. I have the impression that that is already possible (by restricting package upload to a group, and requiring accounts to be added to that group by an admin), but I haven't confirmed that yet.
Right, I don't think we need to do any more than make sure uploaders are in the appropriate group. It *should* currently be the case that only accounts in the package group can upload, and the first time you upload a new named package then you get added as the initial member of the new package group. Currently for testing purposes anyone can register an account and can then upload new packages. We have two options here: restrict account creation to be manual like in hackage 1, or add a new system-wide "uploaders" group for accounts that are authorised to upload new packages and have a manual admin step to add people to the uploaders group. The latter will allow for registered users who are not uploaders which would be useful later to allow things like non-anonymous commenting etc.
#916: At the very least, Hackage 2 needs to support URLs that Hackage 1 supported (unless a conscious decision has been made not to). Ideally we would get the URLs right on the initial release, so that people don't start using the wrong ones. Doesn't sound hard, so may as well do before the switchover.
Yes, barring mistakes this should work already. There's a "legacy" feature module that provides a bunch of redirects.
#918: This is the main missing functionality currently missing from the user's point of view.
Right. You'll see there's some code for a doc builder. This needs to be improved. Perhaps it can share code with the mirror client which is reasonably robust.
Conclusion ----------
I think the following are the blockers for deploying Hackage 2:
* #911 upload perms; may be good enough already * #916 check URLs are OK * #918 build haddock (and HsColour) docs
Right.
* Show source respository on package pages
Should be easy to port that from the old code.
* Support the existing "Distributions" files, and show info on package pages
I advocated at the time the feature was added that it should be done differently so that the hackage server does not poll some url, but people in charge of distros push instead. I think it would not be a blocker to not implement the distribution info system as it is now and when eventually spending the time to implement it, switch to doing it in a more sensible way.
(plus enough testing to give us confidence in it, of course).
One of the main things here is adding tests that the database dump/restore mechanism round trips correctly.
Does that match other people's opinions? Did I miss anything?
Looks good. Something to keep in mind is memory usage. I know Jeremy is looking at this from the infrastructure side, but I think from the app side there's also some likely culprits. Cabal's GenericPackageDescription type is very large in memory. Having 10's of 1000's of these means lots of memory. One hopefully easy way to save memory here without going to the hassle of redoing Cabal's type definitions is simply to increase sharing. There's a huge amount of repeated information. Start by sharing all the package names and versions. Then there's other meta-data that rarely changes between versions of the same package. This kind of thing should be easy to evaluate, just write a test prog that reads the index file and look at peak memory use. Then try sharing stuff and see how much it drops. This sharing optimisation would still be useful even if later we go and redo GenericPackageDescription to be more compact. Duncan

On Mon, Jul 02, 2012 at 08:14:01PM +0100, Duncan Coutts wrote:
On Mon, 2012-07-02 at 12:25 +0100, Ian Lynagh wrote:
Conclusion ----------
I think the following are the blockers for deploying Hackage 2:
* #911 upload perms; may be good enough already * #916 check URLs are OK * #918 build haddock (and HsColour) docs
I forgot that the bug tracker had moved to github. So actually these are now: * #901 upload perms; may be good enough already * #906 check URLs are OK * #908 build haddock (and HsColour) docs and are the tickets marked "important" or "urgent" on https://github.com/haskell/cabal/issues?labels=hackage2&page=1&state=open
* Show source respository on package pages
Should be easy to port that from the old code.
I've filed #965 (hackage2, important) for that.
* Support the existing "Distributions" files, and show info on package pages
I advocated at the time the feature was added that it should be done differently so that the hackage server does not poll some url, but people in charge of distros push instead. I think it would not be a blocker to not implement the distribution info system as it is now and when eventually spending the time to implement it, switch to doing it in a more sensible way.
OK, I won't treat that as a blocker then.
(plus enough testing to give us confidence in it, of course).
One of the main things here is adding tests that the database dump/restore mechanism round trips correctly.
#966 (hackage2, important) filed.
Something to keep in mind is memory usage.
Will do, but currently I don't think this is a blocker for deploying 2.0. Thanks Ian

On Tue, Jul 3, 2012 at 2:27 PM, Ian Lynagh
On Mon, Jul 02, 2012 at 08:14:01PM +0100, Duncan Coutts wrote:
Something to keep in mind is memory usage.
Will do, but currently I don't think this is a blocker for deploying 2.0.
Isn't it the reason why the test server (http://hackage.factisresearch.com/) is constantly down? Or is that just because no-one's paying much attention?

On Mon, Jul 2, 2012 at 3:14 PM, Duncan Coutts
Something to keep in mind is memory usage. I know Jeremy is looking at this from the infrastructure side, but I think from the app side there's also some likely culprits. Cabal's GenericPackageDescription type is very large in memory. Having 10's of 1000's of these means lots of memory. One hopefully easy way to save memory here without going to the hassle of redoing Cabal's type definitions is simply to increase sharing. There's a huge amount of repeated information. Start by sharing all the package names and versions. Then there's other meta-data that rarely changes between versions of the same package. This kind of thing should be easy to evaluate, just write a test prog that reads the index file and look at peak memory use. Then try sharing stuff and see how much it drops. This sharing optimisation would still be useful even if later we go and redo GenericPackageDescription to be more compact.
This should not hold up the launch of Hackage 2 (which is very important) but I think it's an important issue that we need to address: we don't want to store the perhaps most important data the Haskell community has in an experimental data store! Creating a correct data store (i.e. ACID) that also handles a moderate amount of load is a quite difficult undertaking and it shouldn't be taken lightly. Lets stick the data in some SQL database and spend our energy on other things. :) Cheers, Johan

On 3 July 2012 20:38, Johan Tibell
On Mon, Jul 2, 2012 at 3:14 PM, Duncan Coutts
wrote: Something to keep in mind is memory usage. I know Jeremy is looking at this from the infrastructure side, but I think from the app side there's also some likely culprits. Cabal's GenericPackageDescription type is very large in memory. Having 10's of 1000's of these means lots of memory. One hopefully easy way to save memory here without going to the hassle of redoing Cabal's type definitions is simply to increase sharing. There's a huge amount of repeated information. Start by sharing all the package names and versions. Then there's other meta-data that rarely changes between versions of the same package. This kind of thing should be easy to evaluate, just write a test prog that reads the index file and look at peak memory use. Then try sharing stuff and see how much it drops. This sharing optimisation would still be useful even if later we go and redo GenericPackageDescription to be more compact.
This should not hold up the launch of Hackage 2 (which is very important) but I think it's an important issue that we need to address: we don't want to store the perhaps most important data the Haskell community has in an experimental data store! Creating a correct data store (i.e. ACID) that also handles a moderate amount of load is a quite difficult undertaking and it shouldn't be taken lightly. Lets stick the data in some SQL database and spend our energy on other things. :)
I still disagree that going with an external SQL db will be easier. The big advantage of the acid-state (and similar) data stores is that they let us use Haskell types properly and don't imply a separate external data model and a marshalling stage. That said, I also do not trust acid-state for long term storage (simply because the binary format it uses isn't sensible) which is why the hackage server already has a system for dumping and restoring to standard formats (like csv, tarballs etc). So if we use this backup system properly (ie in combination with a system for backups to other machines) then I think there's little chance of data loss. Additionally, the really important data (the packages) are stored in the file system. Duncan

On Tue, Jul 3, 2012 at 4:05 PM, Duncan Coutts
I still disagree that going with an external SQL db will be easier. The big advantage of the acid-state (and similar) data stores is that they let us use Haskell types properly and don't imply a separate external data model and a marshalling stage.
This is moot if the data ends up being corrupted* or if the data store doesn't handle the load. :) This might be the cranky old engineer in me talking, but these things don't usually end well. Using something like mysql-simple to marshal the data is pretty convient; it's very much like writing Binary instances for the data types.
Additionally, the really important data (the packages) are stored in the file system.
While this is true now (we don't have much data except the packages!) in my experience long term the user generated data (i.e. actions they perform on the Hackage site) will be the most valuable (as the packages can be regenerated from source if need be.) For example, using this data is how we're going to do ranking of packages. In fact, this data is what should make Hackage 2 and improvement over Hackage 1. * It doesn't matter much if we can restore the data from backups. Any corruption will still cause downtime and will likely require both manual maintenance and bug fixing in acid-state. -- Johan
participants (4)
-
Ben Millwood
-
Duncan Coutts
-
Ian Lynagh
-
Johan Tibell