Hackage 2 and acid-state vs traditional databases

Hi all, I've had a bit of experience with Hackage 2 and acid-state now, and I'm not convinced that it's the best fit for us: * It's slow. It takes about 5 minutes for me to stop and then start the server. It's actually surprising just how slow it is, so it might be possible/easy to get this down to seconds, but it still won't be instantaneous. * Memory usage is high. It's currently in the 700M-1G range, and to get it that low I had to stop the parsed .cabal files from being held in memory (which presumably has an impact on performance, although I don't know how significant that is), and disable the reverse dependencies feature. It will grow at least linearly with the number of package/versions in Hackage. * Only a single process can use the database at once. For example, if the admins want a tool that will make it easier for them to approve user requests, then that tool needs to be integrated into the Hackage server (or talk to it over HTTP), rather than being standalone. * The database is relatively opaque. While in principle tools could be written for browsing, modifying or querying it, currently none exist (as far as I know). * The above 2 points mean that, for example, there was no easy way for me to find out how many packages use each top-level module hierarchy (Data, Control, etc). This would have been a simple SQL query if the data had been in a traditional database, but as it was I had to write a Haskell program to process all the package .tar.gz's and parse the .cabal files manually. * acid-state forces us to use a server-process model, rather than having processes for individual requests run by apache. I don't know if we would have made this choice anyway, so this may or may not be an issue. But the current model does mean that adding a feature or fixing a bug means restarting the process, rather than just installing the new program in-place. Someone pointed out that one disadvantage of traditional databases is that they discourage you from writing as if everything was Haskell datastructures in memory. For example, if you have things of type data Foo = Foo { str :: String, bool :: Bool, ints :: [Int] } stored in a database then you could write either: foo <- getFoo 23 print $ bool foo or b <- getFooBool 23 print b The former is what you would more naturally write, but would require constructing the whole Foo from the database (including reading an arbitrary number of Ints). The latter is thus more efficient with the database backend, but emphasises that you aren't working with regular Haskell datastructures. This is even more notable with the Cabal types (like PackageDescription) as the types and various utility functions already exist - although it's currently somewhat moot as the current acid-state backend doesn't keep the Cabal datastructures in memory anyway. The other issue raised is performance. I'd want to see (full-size) benchmarks before commenting on that. Has anyone else got any thoughts? On a related note, I think it would be a little nicer to store blobs as e.g. 54/54fb24083b14b5916df11f1ffcd03b26/foo-1.0.tar.gz rather than 54/54fb24083b14b5916df11f1ffcd03b26 I don't think that this breaks anything, so it should be noncontentious. Thanks Ian

On Thu, Sep 6, 2012 at 11:49 AM, Ian Lynagh
Someone pointed out that one disadvantage of traditional databases is that they discourage you from writing as if everything was Haskell datastructures in memory. For example, if you have things of type data Foo = Foo { str :: String, bool :: Bool, ints :: [Int] } stored in a database then you could write either: foo <- getFoo 23 print $ bool foo or b <- getFooBool 23 print b
Using Bryan's mysql-simple library makes mapping between Haskell data types and SQL records quite straightforward (you write one type class instance).
Has anyone else got any thoughts?
I've argued in the past that we should use a SQL database, for mostly the same reasons as you gave above, with the addition that I don't anything but old and battle-tested technology with data as important as the Hackage data. -- Johan

Hi Ian,
We used acid-state (actually happstack-state) at Silk for our session
store. We had the same problems you describe: slow shutdown/startup,
high memory usage, unable to inspect the data. We recently switched to
an SQL database. Just another data point.
Erik
On Thu, Sep 6, 2012 at 8:49 PM, Ian Lynagh
Hi all,
I've had a bit of experience with Hackage 2 and acid-state now, and I'm not convinced that it's the best fit for us:
* It's slow. It takes about 5 minutes for me to stop and then start the server. It's actually surprising just how slow it is, so it might be possible/easy to get this down to seconds, but it still won't be instantaneous.
* Memory usage is high. It's currently in the 700M-1G range, and to get it that low I had to stop the parsed .cabal files from being held in memory (which presumably has an impact on performance, although I don't know how significant that is), and disable the reverse dependencies feature. It will grow at least linearly with the number of package/versions in Hackage.
* Only a single process can use the database at once. For example, if the admins want a tool that will make it easier for them to approve user requests, then that tool needs to be integrated into the Hackage server (or talk to it over HTTP), rather than being standalone.
* The database is relatively opaque. While in principle tools could be written for browsing, modifying or querying it, currently none exist (as far as I know).
* The above 2 points mean that, for example, there was no easy way for me to find out how many packages use each top-level module hierarchy (Data, Control, etc). This would have been a simple SQL query if the data had been in a traditional database, but as it was I had to write a Haskell program to process all the package .tar.gz's and parse the .cabal files manually.
* acid-state forces us to use a server-process model, rather than having processes for individual requests run by apache. I don't know if we would have made this choice anyway, so this may or may not be an issue. But the current model does mean that adding a feature or fixing a bug means restarting the process, rather than just installing the new program in-place.
Someone pointed out that one disadvantage of traditional databases is that they discourage you from writing as if everything was Haskell datastructures in memory. For example, if you have things of type data Foo = Foo { str :: String, bool :: Bool, ints :: [Int] } stored in a database then you could write either: foo <- getFoo 23 print $ bool foo or b <- getFooBool 23 print b
The former is what you would more naturally write, but would require constructing the whole Foo from the database (including reading an arbitrary number of Ints). The latter is thus more efficient with the database backend, but emphasises that you aren't working with regular Haskell datastructures.
This is even more notable with the Cabal types (like PackageDescription) as the types and various utility functions already exist - although it's currently somewhat moot as the current acid-state backend doesn't keep the Cabal datastructures in memory anyway.
The other issue raised is performance. I'd want to see (full-size) benchmarks before commenting on that.
Has anyone else got any thoughts?
On a related note, I think it would be a little nicer to store blobs as e.g. 54/54fb24083b14b5916df11f1ffcd03b26/foo-1.0.tar.gz rather than 54/54fb24083b14b5916df11f1ffcd03b26
I don't think that this breaks anything, so it should be noncontentious.
Thanks Ian
_______________________________________________ cabal-devel mailing list cabal-devel@haskell.org http://www.haskell.org/mailman/listinfo/cabal-devel

On 6 September 2012 19:49, Ian Lynagh
Hi all,
I've had a bit of experience with Hackage 2 and acid-state now, and I'm not convinced that it's the best fit for us:
* It's slow. It takes about 5 minutes for me to stop and then start the server. It's actually surprising just how slow it is, so it might be possible/easy to get this down to seconds, but it still won't be instantaneous.
Yes it probably is slower than necessary there. It should be possible to make it as fast as reading the data from disk. For near-instantaneous server code upgrades we would need to use a feature of happs-state that currently isn't implemented in acid-state (afaik). The happs-state allowed clustering so you could start a new server process, let it sync up with the state from the existing process, then kill the old process.
* Memory usage is high. It's currently in the 700M-1G range, and to get it that low I had to stop the parsed .cabal files from being held in memory (which presumably has an impact on performance, although I don't know how significant that is), and disable the reverse dependencies feature. It will grow at least linearly with the number of package/versions in Hackage.
I think this is solvable. The most costly thing is the package metadata. We can use more compact representations (the Cabal PackageDescription is pretty bad in this respect) and secondly by doing a lot of sharing within the package index. Both of these changes could be done just in the Cabal library without significantly affecting things. The sharing would mean it grows linearly but much more slowly. That said, I don't think 1GB should be considered high. If we want to be able to have good performance, then we want all commonly used data in memory anyway. 1GB for a central community server is not at all unreasonable.
* Only a single process can use the database at once. For example, if the admins want a tool that will make it easier for them to approve user requests, then that tool needs to be integrated into the Hackage server (or talk to it over HTTP), rather than being standalone.
* The database is relatively opaque. While in principle tools could be written for browsing, modifying or querying it, currently none exist (as far as I know).
In both these points, I would argue that we should simply make all the data available via http REST interfaces. Instead of making the data available only locally to someone with direct access to the database, it should be there in machine readable form so that everyone can get it and experiment. In the approving user requests example, that just needs an HTTP PUT/POST. It doesn't need any web form. Totally scriptable with wget/curl etc.
* The above 2 points mean that, for example, there was no easy way for me to find out how many packages use each top-level module hierarchy (Data, Control, etc). This would have been a simple SQL query if the data had been in a traditional database, but as it was I had to write a Haskell program to process all the package .tar.gz's and parse the .cabal files manually.
As I mentioned on IRC the problem here is really that cabal-install isn't currently available as a library. If it were then loading up the 00-index.tar file with all the .cabal files is just a couple lines of code. Then your query is just a list comprehension. As another example, we should be able to make all the package data available in other formats, e.g. JSON and that provides loads of opportunity for other people to do ad-hoc queries.
* acid-state forces us to use a server-process model, rather than having processes for individual requests run by apache. I don't know if we would have made this choice anyway, so this may or may not be an issue. But the current model does mean that adding a feature or fixing a bug means restarting the process, rather than just installing the new program in-place.
True. See above about quick restarts.
Someone pointed out that one disadvantage of traditional databases is that they discourage you from writing as if everything was Haskell datastructures in memory.
Right.
This is even more notable with the Cabal types (like PackageDescription) as the types and various utility functions already exist
- although it's currently somewhat moot as the current acid-state backend doesn't keep the Cabal datastructures in memory anyway.
As I've said, I think that's fixable.
The other issue raised is performance. I'd want to see (full-size) benchmarks before commenting on that.
There's basically two approaches you can take here: scaling up or out, that is making a single server handle lots and lots of requests, or have lots of machines. It is much simpler (both in terms of code and infrastructure) to have a single server. The performance here can still scale a long way if we can keep all data that requests need in-memory. Additionally, we can scale read requests a lot further using caching proxies without adding a great deal of complexity.
Has anyone else got any thoughts?
Easy deployment was also a goal. We want anyone to be able to deploy a hackage server instance, not just a central community one. Setting up external databases (mysql, postgres) makes that a lot harder. In-process databases like sqlite would be a lot slower than in-memory dbs. Currently it is rather easy to deploy. It's more or less just cabal install and go. Johan mentions the issue of data formats. It's a very valid point. That is why we designed the server to do full backups in standard external formats (csv etc). We do not and should not rely on the acid-state binary format. But as long as we do actually use the dump feature then I am not at all concerned. Duncan

On 6 September 2012 21:06, Duncan Coutts
On 6 September 2012 19:49, Ian Lynagh
wrote: * Only a single process can use the database at once. For example, if the admins want a tool that will make it easier for them to approve user requests, then that tool needs to be integrated into the Hackage server (or talk to it over HTTP), rather than being standalone.
* The database is relatively opaque. While in principle tools could be written for browsing, modifying or querying it, currently none exist (as far as I know).
In both these points, I would argue that we should simply make all the data available via http REST interfaces. Instead of making the data available only locally to someone with direct access to the database, it should be there in machine readable form so that everyone can get it and experiment.
In the approving user requests example, that just needs an HTTP PUT/POST. It doesn't need any web form. Totally scriptable with wget/curl etc.
I should probably say this more clearly... The original design of the new server was that it should primarily be a web database (REST and all that), with a user-facing website tacked on. That is, whatever extra fancy features it has, all of the data it stores should be accessible and modifiable via http (i.e. get/put/post) in at least one machine readable format. This alone makes the system extensible, and extensible by third parties. For the most part the current implementation does indeed make all the data available via standard http methods. We can make that better by adding more formats (esp json) and making sure we consistently make everything available, and by making it discoverable that everything is indeed available. The latter should be possible e.g. via an automatically generated site map. We have URL templates for all the resources that are available, and we can simply automatically collate those and make them available on a generated page. So if we've done it properly, there should be no need for direct access to the database. And as I said before, doing it properly has the great advantage that all the data is available to everyone to do cool things we've not thought of yet. Duncan

On Thu, Sep 6, 2012 at 2:49 PM, Ian Lynagh
* Only a single process can use the database at once. For example, if the admins want a tool that will make it easier for them to approve user requests, then that tool needs to be integrated into the Hackage server (or talk to it over HTTP), rather than being standalone.
This much is wrong; acid-state does have functionality for accessing/modifying the database from other processes, namely System.Data.Remote, but I haven't investigated deeply. That said, I think there would be some potential advantages of moving to SQL. My personal preference would for PostgreSQL, particularly because of the wealth of datatypes it provides, or can be added to it via already existing third-party plugins. Compared to MySQL, it's definitely less fluff, more stuff. For example, the schema I sketched out for the account request processing uses hstore, which is an extension (included with the standard postgres distribution, even) that provides a field of type (Map String String), to store both public profile information as well as HTTP headers the client sent with the request, as well as indexes on that type. Or, we might store module names using the ltree extension, which is a perfect match for the problem: PostgreSQL would then understand the components of the module name, be able to index those, and provide a variety of additional operators to query them. http://www.postgresql.org/docs/9.1/static/hstore.html http://www.postgresql.org/docs/9.1/static/ltree.html Also, the rules and trigger systems provided by PostgreSQL are considerably more capable than anything in MySQL, which allow you to do things such as create updateable views (with some work). Also, I find PL/pgSQL and asynchronous notifications particularly useful, and while I don't know MySQL well enough to definitively say that MySQL has no answer to these features, there isn't anything obvious that is as simple and powerful as asynchronous notifications nor as fully featured as PL/pgSQL. (And, there are a ton of other PL/*** languages out there, somebody even has a very rudimentary start on PL/Haskell.) One of the most common use cases for asynchronous notifications is to be able to efficiently and robustly respond to changes made to a database. (Robust in the sense that the client making the change doesn't know that other processes need to be informed of said change, as this will instead be handled by a trigger that NOTIFYs the interested clients.) Best, Leon

On Thu, Sep 06, 2012 at 04:36:07PM -0400, Leon Smith wrote:
That said, I think there would be some potential advantages of moving to SQL. My personal preference would for PostgreSQL, particularly because of the wealth of datatypes it provides, or can be added to it via already existing third-party plugins. Compared to MySQL, it's definitely less fluff, more stuff.
I think it would be best to use lowest-common-denominator SQL, so that one can set up a Hackage 2 instance with postgreq, mysql, sqlite, or another DB backend. Thanks Ian

On Fri, Sep 7, 2012 at 8:48 AM, Ian Lynagh
I think it would be best to use lowest-common-denominator SQL, so that one can set up a Hackage 2 instance with postgreq, mysql, sqlite, or another DB backend.
While I think that's a laudable goal in theory, unfortunately the only database-independent client interface Haskell has at the moment is HDBC, which is rather painful to use compared to mysql-simple, pgsql-simple, postgresql-simple, and sqlite-simple. And unfortunately, there are also some incompatibilities between the *-simples, though perhaps one could, with great care, write code that works on any of them by changing which module is imported. The other issue is that least-common-denominator sql is also pretty restrictive, and that in a great many of my own projects, I've found database-specific functionality compelling, often allowing me to implement things much more efficiently than otherwise, either in terms of programming time or runtime or both. Best, Leon
participants (5)
-
Duncan Coutts
-
Erik Hesselink
-
Ian Lynagh
-
Johan Tibell
-
Leon Smith