
On 3 July 2012 20:38, Johan Tibell
On Mon, Jul 2, 2012 at 3:14 PM, Duncan Coutts
wrote: Something to keep in mind is memory usage. I know Jeremy is looking at this from the infrastructure side, but I think from the app side there's also some likely culprits. Cabal's GenericPackageDescription type is very large in memory. Having 10's of 1000's of these means lots of memory. One hopefully easy way to save memory here without going to the hassle of redoing Cabal's type definitions is simply to increase sharing. There's a huge amount of repeated information. Start by sharing all the package names and versions. Then there's other meta-data that rarely changes between versions of the same package. This kind of thing should be easy to evaluate, just write a test prog that reads the index file and look at peak memory use. Then try sharing stuff and see how much it drops. This sharing optimisation would still be useful even if later we go and redo GenericPackageDescription to be more compact.
This should not hold up the launch of Hackage 2 (which is very important) but I think it's an important issue that we need to address: we don't want to store the perhaps most important data the Haskell community has in an experimental data store! Creating a correct data store (i.e. ACID) that also handles a moderate amount of load is a quite difficult undertaking and it shouldn't be taken lightly. Lets stick the data in some SQL database and spend our energy on other things. :)
I still disagree that going with an external SQL db will be easier. The big advantage of the acid-state (and similar) data stores is that they let us use Haskell types properly and don't imply a separate external data model and a marshalling stage. That said, I also do not trust acid-state for long term storage (simply because the binary format it uses isn't sensible) which is why the hackage server already has a system for dumping and restoring to standard formats (like csv, tarballs etc). So if we use this backup system properly (ie in combination with a system for backups to other machines) then I think there's little chance of data loss. Additionally, the really important data (the packages) are stored in the file system. Duncan