A distributed and replicating native Haskell database

Joel Reymont

2 Feb 2007 2 Feb '07

7:36 p.m.

Folks, Allegro Common Lisp has AllegroCache [1], a database built on B-Trees that lets one store Lisp objects of any type. You can designate certain slots (object fields) as key and use them for lookup. ACL used to come bundled with the ObjectStore OODBMS for the same purpose but then adopted a native solution. AllegroCache is not distributed or replicating but supports automatic versioning. You can redefine a class and new code will store more (or less) data in the database while code that uses the old schema will merrily chug along. Erlang [2] has Mnesia [3] which lets you store any Erlang term ("object"). It stores records (tuples, actually) and you can also designate key fields and use them for lookup. I haven't looked into this deeply but Mnesia is built on top of DETS (Disk-based Term Storage) which most likely also uses a form of B-Trees. Mnesia is distributed and replicated in real-time. There's no automatic versioning with Mnesia but user code can be run to read old records and write new ones. Would it make sense to build a similar type of a database for Haskell? I can immediately see how versioning would be much harder as Haskell is statically typed. I would love to extend recent gains in binary serialization, though, to add indexing of records based on a designated key, distribution and real-time replication. What do you think? To stimulate discussion I would like to ask a couple of pointed questions: - How would you "designate" a key for a Haskell data structure? - Is the concept of a schema applicable to Haskell? Thanks, Joel [1] http://franz.com/products/allegrocache/index.lhtml [2] http://erlang.org/faq/t1.html [3] http://erlang.org/faq/x1409.html -- http://wagerlabs.com/

Show replies by date

Paul Johnson

2 Feb 2 Feb

8:36 p.m.

New subject: A distributed and replicating native Haskell database

Joel Reymont wrote:

...

Folks,

Allegro Common Lisp has AllegroCache [1], a database built on B-Trees that lets one store Lisp objects of any type. You can designate certain slots (object fields) as key and use them for lookup. ACL used to come bundled with the ObjectStore OODBMS for the same purpose but then adopted a native solution.

AllegroCache is not distributed or replicating but supports automatic versioning. You can redefine a class and new code will store more (or less) data in the database while code that uses the old schema will merrily chug along. That implies being able to put persistent code into the database. Easy enough in Lisp, less easy in Haskell. How do you serialize it?

...

Erlang [2] has Mnesia [3] which lets you store any Erlang term ("object"). It stores records (tuples, actually) and you can also designate key fields and use them for lookup. I haven't looked into this deeply but Mnesia is built on top of DETS (Disk-based Term Storage) which most likely also uses a form of B-Trees. Erlang also has a very disciplined approach to code updates, which

As a rule, storing functions along with data is a can of worms. Either you actually store the code as a BLOB or you store a pointer to the function in memory. Either way you run into problems when you upgrade your software and expect the stored functions to work in the new context. presumably helps a lot when functions are stored.

...

Mnesia is distributed and replicated in real-time. There's no automatic versioning with Mnesia but user code can be run to read old records and write new ones.

Would it make sense to build a similar type of a database for Haskell? I can immediately see how versioning would be much harder as Haskell is statically typed. I would love to extend recent gains in binary serialization, though, to add indexing of records based on a designated key, distribution and real-time replication.

I very much admire Mnesia, even though I'm not an Erlang programmer. It would indeed be really cool to have something like that. But Mnesia is built on the Erlang OTP middleware. I would suggest that Haskell needs a middleware with the same sort of capabilities first. Then we can build a database on top of it.

...

What do you think?

To stimulate discussion I would like to ask a couple of pointed questions:

- How would you "designate" a key for a Haskell data structure? I haven't tried compiling it, but something like:

class (Ord k) => DataKey a k | a -> k where keyValue :: a -> k

...

- Is the concept of a schema applicable to Haskell? The real headache is type safety. Erlang is entirely dynamically typed, so untyped schemas with column values looked up by name at run-time fit right in, and its up to the programmer to manage schema and code evolution to prevent errors. Doing all this in a statically type safe way is another layer of complexity and checking.

Actually this is also just another special case of the middleware case. If we have two processes, A and B, that need to communicate then they need to agree on a protocol. Part of that protocol is the data types. If B is a database then this reduces to the schema problem. So lets look at the more general problem first and see if we can solve that. There are roughly two ways for A and B to agree on the protocol. One is to implement the protocol separately in A and B. If it is done correctly then they will work together. But this is not statically checkable (ignoring state machines and model checking for now). This is the Erlang approach, because dynamic checking is the Erlang philosophy. Alternatively the protocol can be defined in a special purpose protocol module P, and A and B then import P. This is the approach taken by CORBA with IDL. However what happens if P is updated to P'? Does this mean that both A and B need to be recompiled and restarted simultaneously? Requiring this is a Bad Thing; imagine if every bank in the world had to upgrade and restart its computers simultaneously in order to upgrade a common protocol. (This protocol versioning problem was one of the major headaches with CORBA.) We would have to have P and P', live simultaneously, and processes negotiate the latest version of the protocol that they both support when they start talking. That way the introduction of P' does not need to be simultaneous with the withdrawal of P. There is still the possibility of a run-time failure at the protocol negotiation stage of course, if it transpires that the to processes have no common protocol. So we need a DSL which allows the definition of data types and abstract protocols (i.e. who sends what to whom when) that can be imported by the two processes (do we need N-way protocols?) on each end of the link. If we could embed this in Haskell directly then so much the better, but something that needs preprocessing would be fine too. However there is a wrinkle here: what about "pass through" processes which don't interpret the data but just store and forward it. Various forms of protocol adapter fit this scenario, as does the database you originally asked about. We want to be able to have these things talk in a type-safe manner without needing to be compiled with every data structure they transmit. You could describe these things using type variables, so that for instance if a database table is created to store a datatype D then any process reading or writing the data must also use D, even though the database itself knows nothing more of D than the name. Similarly a gateway that sets up a channel for datatype D would not need to know anything more than the name. Paul.

Joel Reymont

8:50 p.m.

New subject: A distributed and replicating native Haskell database

On Feb 2, 2007, at 3:06 PM, Paul Johnson wrote:

...

As a rule, storing functions along with data is a can of worms. Either you actually store the code as a BLOB or you store a pointer to the function in memory. Either way you run into problems when you upgrade your software and expect the stored functions to work in the new context.

ACache does not store code in the database. You cannot read the database unless you have your original class code. ACache may store the "schema", i.e. the parent class names, slot names, etc.

...

Erlang also has a very disciplined approach to code updates, which presumably helps a lot when functions are stored.

No storing of code here either. What you store in Erlang is just tuples so there's no schema or class definition. No functions are stored since any Erlang code can fetch the tuples from Mnesia. You do need to have the original record definition around but this is just to be able to refer to tuple elements with field names rather name field position.

...

I very much admire Mnesia, even though I'm not an Erlang programmer. It would indeed be really cool to have something like that. But Mnesia is built on the Erlang OTP middleware. I would suggest that Haskell needs a middleware with the same sort of capabilities first. Then we can build a database on top of it.

Right. That would be a prerequisite.

...

The real headache is type safety. Erlang is entirely dynamically typed, so untyped schemas with column values looked up by name at run-time fit right in, and its up to the programmer to manage schema and code evolution to prevent errors. Doing all this in a statically type safe way is another layer of complexity and checking.

I believe Lambdabot does schema evolution.

...

Alternatively the protocol can be defined in a special purpose protocol module P, and A and B then import P. This is the approach taken by CORBA with IDL. However what happens if P is updated to P'? Does this mean that both A and B need to be recompiled and restarted simultaneously? Requiring this is a Bad Thing; imagine if every bank in the world had to upgrade and restart its computers simultaneously in order to upgrade a common protocol.

I would go for the middle ground and dump the issue entirely. Lets be practical here. When a binary protocol is updated, all code using the protocol needs to be updated. This would be good enough. It would suite me just fine too as I'm not yearning for CORBA, I just want to build a trading infrastructure entirely in Haskell.

...

There is still the possibility of a run-time failure at the protocol negotiation stage of course, if it transpires that the to processes have no common protocol.

So no protocol negotiation!

...

However there is a wrinkle here: what about "pass through" processes which don't interpret the data but just store and forward it. Various forms of protocol adapter fit this scenario, as does the database you originally asked about.

Any packet traveling over the wire would need to have a size, followed by a body. Any pass-through protocol can just take the binary blob and re-send it. Thanks, Joel -- http://wagerlabs.com/

Rich Neswold

9:49 p.m.

New subject: A distributed and replicating native Haskell database

On 2/2/07, Joel Reymont wrote:

...

Would it make sense to build a similar type of a database for Haskell? What do you think?

I used mnesia in a little Erlang project I did. It did nearly everything I wanted it to (and stuff I didn't need since it does replication and my project didn't require it.) The one item I didn't care too much about is that your data integrity is enforced by your application. There didn't appear to be constructs that let mnesia protect data corruption. Admittedly, I am inexperienced with mnesia. There may be a way to add constraints and I just missed finding it in the documentation. I prefer using HSQL with PostgreSQL. PostgreSQL meets ACID requirements. It supports foreign key constraints, index constraints, and column constraints. It has stored procedures and triggers. And on and on. When HSQL says a transaction has been committed, I *know* the data is safe. -- Rich AIM : rnezzy ICQ : 174908475

6916

Age (days ago)

6916

Last active (days ago)

List overview

Download

3 comments

3 participants

participants (3)

Joel Reymont
Paul Johnson
Rich Neswold