Inverse of HaskellDB

Jonathan Geddes

25 Sep 2010 25 Sep '10

7:31 p.m.

Cafe, HaskellDB takes a database schema and produces Haskell data structures (plus some other query-related stuff for its EDSL query language). What I'm looking for is the inverse of this functionality. I want to create tables based on a Haskell data structure with a few simple rules. These rules include: if a field is not of the form `Maybe a' then it can't be nullable in the database. If a field is not a primitive (in the database) then it is actually stored in another table and a reference id is stored in the table. Tables are produced recursively, unless they already exist, etc. The HaskellDB approach is great for interfacing with existing tables, but in my case I already have data structures and now I would like a quick way to create tables to persist them. Does such a thing exist? If not, would you find it useful? I may take this up as a side project if it does not already exist and others would find it useful. Thanks, --Jonathan

Show replies by date

Antoine Latter

25 Sep 25 Sep

7:57 p.m.

That sounds pretty awesome to me. Have you given any thought as to how you want to approach versioning? Maybe I'm asking a silly question - I have very little real world experience with relation databases and how to version schemas. Antoine On Sep 25, 2010 2:31 PM, "Jonathan Geddes" wrote:

...

Cafe,

HaskellDB takes a database schema and produces Haskell data structures (plus some other query-related stuff for its EDSL query language).

What I'm looking for is the inverse of this functionality. I want to create tables based on a Haskell data structure with a few simple rules. These rules include: if a field is not of the form `Maybe a' then it can't be nullable in the database. If a field is not a primitive (in the database) then it is actually stored in another table and a reference id is stored in the table. Tables are produced recursively, unless they already exist, etc.

The HaskellDB approach is great for interfacing with existing tables, but in my case I already have data structures and now I would like a quick way to create tables to persist them.

Does such a thing exist? If not, would you find it useful? I may take this up as a side project if it does not already exist and others would find it useful.

Thanks,

--Jonathan _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Jonathan Geddes

8:23 p.m.

Versioning is a tricky problem regardless of how you are creating tables. And that isn't the problem I was aiming to tackle; the problem I was aiming to tackle is a bit more narrow than that: I have a record and now I need a table to stick it in. By the way, how does HaskellDB handle versioning? --Jonathan On Sat, Sep 25, 2010 at 1:57 PM, Antoine Latter wrote:

...

That sounds pretty awesome to me.

Have you given any thought as to how you want to approach versioning?

Maybe I'm asking a silly question - I have very little real world experience with relation databases and how to version schemas.

Antoine

On Sep 25, 2010 2:31 PM, "Jonathan Geddes" wrote:

...
Cafe,

HaskellDB takes a database schema and produces Haskell data structures (plus some other query-related stuff for its EDSL query language).

What I'm looking for is the inverse of this functionality. I want to create tables based on a Haskell data structure with a few simple rules. These rules include: if a field is not of the form `Maybe a' then it can't be nullable in the database. If a field is not a primitive (in the database) then it is actually stored in another table and a reference id is stored in the table. Tables are produced recursively, unless they already exist, etc.

The HaskellDB approach is great for interfacing with existing tables, but in my case I already have data structures and now I would like a quick way to create tables to persist them.

Does such a thing exist? If not, would you find it useful? I may take this up as a side project if it does not already exist and others would find it useful.

Thanks,

--Jonathan _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Chris Eidhof

28 Sep 28 Sep

3:13 p.m.

Hey Jonathan, I've done some work on this. The hard part is defining relationships between datatypes: how do you model this in Haskell? I've some code on github: http://github.com/chriseidhof/persist, you might be interested in that. -chris On 25 sep 2010, at 21:31, Jonathan Geddes wrote:

...

Cafe,

HaskellDB takes a database schema and produces Haskell data structures (plus some other query-related stuff for its EDSL query language).

What I'm looking for is the inverse of this functionality. I want to create tables based on a Haskell data structure with a few simple rules. These rules include: if a field is not of the form `Maybe a' then it can't be nullable in the database. If a field is not a primitive (in the database) then it is actually stored in another table and a reference id is stored in the table. Tables are produced recursively, unless they already exist, etc.

The HaskellDB approach is great for interfacing with existing tables, but in my case I already have data structures and now I would like a quick way to create tables to persist them.

Does such a thing exist? If not, would you find it useful? I may take this up as a side project if it does not already exist and others would find it useful.

Thanks,

--Jonathan _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Ozgur Akgun

3:33 p.m.

...

How do you define relationships between data types?

Well, why is it any different from other fields? From one of your examples [1], I'd expect you to have a list of questions in the Quiz data type, and if necessary, a quiz field in the Question data type. This might be a bit tricky but certainly achievable [2]. Something like the following: data Quiz = Quiz { description :: String, subject :: String, questions :: [Question] } deriving (Show, Read) data Question = Question { title :: String, choiceA :: String, choiceB :: String, choiceC :: String, quiz :: Quiz } deriving (Show, Read) [1] http://github.com/chriseidhof/persist/blob/master/examples/Model.phs [2] http://www.haskell.org/haskellwiki/Tying_the_Knot On 28 September 2010 16:13, Chris Eidhof wrote:

...

Hey Jonathan,

I've done some work on this. The hard part is defining relationships between datatypes: how do you model this in Haskell? I've some code on github: http://github.com/chriseidhof/persist, you might be interested in that.

-chris

On 25 sep 2010, at 21:31, Jonathan Geddes wrote:

...
Cafe,

HaskellDB takes a database schema and produces Haskell data structures (plus some other query-related stuff for its EDSL query language).

What I'm looking for is the inverse of this functionality. I want to create tables based on a Haskell data structure with a few simple rules. These rules include: if a field is not of the form `Maybe a' then it can't be nullable in the database. If a field is not a primitive (in the database) then it is actually stored in another table and a reference id is stored in the table. Tables are produced recursively, unless they already exist, etc.

The HaskellDB approach is great for interfacing with existing tables, but in my case I already have data structures and now I would like a quick way to create tables to persist them.

Does such a thing exist? If not, would you find it useful? I may take this up as a side project if it does not already exist and others would find it useful.

Thanks,

--Jonathan _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- Ozgur Akgun

Chris Eidhof

29 Sep 29 Sep

9:41 a.m.

On 28 sep 2010, at 17:33, Ozgur Akgun wrote:

...

How do you define relationships between data types?

Well, why is it any different from other fields? From one of your examples [1], I'd expect you to have a list of questions in the Quiz data type, and if necessary, a quiz field in the Question data type. This might be a bit tricky but certainly achievable [2].

This is really tricky. For example, consider storing a large tree in the database:

...

data Tree = Node Int Tree Tree | Leaf Int

This means you need to read the entire tree from the database. Or consider cyclic datastructures (such as the example you gave). How do you store this? The only way to inspect this is using a library like data-reify [1]. I think the problem might be a bit harder than you suspect. Another way to solve it is using Sebastiaan Visser's framework, described in his paper [2], but that's also rather complicated. -chris [1]: http://hackage.haskell.org/package/data-reify [2]: http://github.com/downloads/sebastiaanvisser/msc-thesis/wgp10-genstorage.pdf

Ozgur Akgun

10:01 a.m.

OK, I am rephrasing it a bit then :) I definitely don't think this would be trivial to implement. However, I'd expect a decent solution to this problem, not to have special combinators to describe relations between data types, but let the user model their data using plain haskell data types, and infer the associated table structure just by looking at the data types. I'll give this a harder thought once I find the time. There is the huge barrier of TH, stopping me from playing with things like this. Anyway, have fun! :) On 29 September 2010 10:41, Chris Eidhof wrote:

...

On 28 sep 2010, at 17:33, Ozgur Akgun wrote:

...
How do you define relationships between data types?

Well, why is it any different from other fields? From one of your examples [1], I'd expect you to have a list of questions in the Quiz data type, and if necessary, a quiz field in the Question data type. This might be a bit tricky but certainly achievable [2].

This is really tricky. For example, consider storing a large tree in the database:

...
data Tree = Node Int Tree Tree | Leaf Int

This means you need to read the entire tree from the database. Or consider cyclic datastructures (such as the example you gave). How do you store this? The only way to inspect this is using a library like data-reify [1].

I think the problem might be a bit harder than you suspect.

Another way to solve it is using Sebastiaan Visser's framework, described in his paper [2], but that's also rather complicated.

-chris

[1]: http://hackage.haskell.org/package/data-reify [2]: http://github.com/downloads/sebastiaanvisser/msc-thesis/wgp10-genstorage.pdf

-- Ozgur Akgun

Michael Snoyman

10:21 a.m.

I think this approach is not possible without involving some fairly ugly unsafeInterleaveIO/unsafePerformIO calls. A simple example using a common web programming example: support I have a multi-user blog site, where each user can have multiple entries. I would model this using standard Haskell datatypes as: data Entry = Entry { title :: String, content :: String } data Blogger = Blogger { name :: String, entries :: [Entry] } Obviously we'll need some kind of blogger loading function: getBloggerByName :: String -> IO Blogger Either this will load up all entries (a potentially incredibly costly operation) or use unsafe IO down the road. Especially when using database connections, this can be incredibly bad: the connection could be closed, the SQL statement could be reused by another request, etc. My persistent library follows a similar approach to what Chris describes, though with a very different syntax. You can see a very similar example on the documentation site[1]. Michael [1] http://docs.yesodweb.com/book/persistent/#relations On Wed, Sep 29, 2010 at 12:01 PM, Ozgur Akgun wrote:

...

OK, I am rephrasing it a bit then :) I definitely don't think this would be trivial to implement. However, I'd expect a decent solution to this problem, not to have special combinators to describe relations between data types, but let the user model their data using plain haskell data types, and infer the associated table structure just by looking at the data types. I'll give this a harder thought once I find the time. There is the huge barrier of TH, stopping me from playing with things like this. Anyway, have fun! :)

On 29 September 2010 10:41, Chris Eidhof wrote:

...
On 28 sep 2010, at 17:33, Ozgur Akgun wrote:

...
How do you define relationships between data types?

Well, why is it any different from other fields? From one of your examples [1], I'd expect you to have a list of questions in the Quiz data type, and if necessary, a quiz field in the Question data type. This might be a bit tricky but certainly achievable [2].

This is really tricky. For example, consider storing a large tree in the database:

...
data Tree = Node Int Tree Tree | Leaf Int

This means you need to read the entire tree from the database. Or consider cyclic datastructures (such as the example you gave). How do you store this? The only way to inspect this is using a library like data-reify [1].

I think the problem might be a bit harder than you suspect.

Another way to solve it is using Sebastiaan Visser's framework, described in his paper [2], but that's also rather complicated.

-chris

[1]: http://hackage.haskell.org/package/data-reify [2]: http://github.com/downloads/sebastiaanvisser/msc-thesis/wgp10-genstorage.pdf

-- Ozgur Akgun

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Felipe Lessa

30 Sep 30 Sep

2:35 p.m.

On Wed, Sep 29, 2010 at 7:21 AM, Michael Snoyman wrote:

...

I think this approach is not possible without involving some fairly ugly unsafeInterleaveIO/unsafePerformIO calls. A simple example using a common web programming example: support I have a multi-user blog site, where each user can have multiple entries. I would model this using standard Haskell datatypes as:

data Entry = Entry { title :: String, content :: String } data Blogger = Blogger { name :: String, entries :: [Entry] }

Obviously we'll need some kind of blogger loading function:

getBloggerByName :: String -> IO Blogger

Either this will load up all entries (a potentially incredibly costly operation) or use unsafe IO down the road. Especially when using database connections, this can be incredibly bad: the connection could be closed, the SQL statement could be reused by another request, etc.

It may be possible to tag those data fields that are not to be loaded on the spot. For example,

...

data Entry = Entry { title :: String, content :: String } data Blogger db = Blogger { name :: String, entries :: OnDB db [Entry] }

class Monad db => Database db where data OnDB db :: * -> * fetch :: OnDB db a -> db a fetchSome :: Criteria a -> OnDB db [a] -> db [a]

newtype InMemory a = InMemory a instance Database InMemory where newtype OnDB db a = OnDBMem a fetch (OnDBMem x) = return x fetchSome = ...

instance Database SQL where ...

Cheers, -- Felipe.

Christopher Done

3:25 p.m.

While we're on the topic of databases, I really wanted to try out query/inserting from/to my database with records like this: http://hpaste.org/40240/db_library_approach Define a record: data Person f = Person { pid :: f Integer , firstName :: f String , middleName :: f (Maybe String) , lastName :: f String , age :: f Integer } then I'd query it like personById :: Integer -> Query Person personById i = Person { pid = constant i , firstName = anything , middleName = anything , lastName = anything , age = anything } deriving (Typeable,Data) Or with a Data.Default instance: personById :: Integer -> Query Person personById i = def { pid = constant i } But I have yet to figure out how to derive a Typeable instance for such a type. I don't want to write any instances of anything. Technically I can get the field names and values using a Data.Data.Data instance, but I don't know, maybe I should make a TypeablePolymorphicKinds class or something and try to derive for it. UHC's generic deriving would probably be good for something like this, but I want GHC.

Michael Snoyman

4:26 p.m.

On Thu, Sep 30, 2010 at 4:35 PM, Felipe Lessa wrote:

...

On Wed, Sep 29, 2010 at 7:21 AM, Michael Snoyman wrote:

...
I think this approach is not possible without involving some fairly ugly unsafeInterleaveIO/unsafePerformIO calls. A simple example using a common web programming example: support I have a multi-user blog site, where each user can have multiple entries. I would model this using standard Haskell datatypes as:

data Entry = Entry { title :: String, content :: String } data Blogger = Blogger { name :: String, entries :: [Entry] }

Obviously we'll need some kind of blogger loading function:

getBloggerByName :: String -> IO Blogger

Either this will load up all entries (a potentially incredibly costly operation) or use unsafe IO down the road. Especially when using database connections, this can be incredibly bad: the connection could be closed, the SQL statement could be reused by another request, etc.

It may be possible to tag those data fields that are not to be loaded on the spot. For example,

...
data Entry = Entry { title :: String, content :: String } data Blogger db = Blogger { name :: String, entries :: OnDB db [Entry] }

class Monad db => Database db where data OnDB db :: * -> * fetch :: OnDB db a -> db a fetchSome :: Criteria a -> OnDB db [a] -> db [a]

newtype InMemory a = InMemory a instance Database InMemory where newtype OnDB db a = OnDBMem a fetch (OnDBMem x) = return x fetchSome = ...

instance Database SQL where ...

I wasn't claiming my approach was the *only* approach, just stating that it doesn't seem feasible to use the "simple" Haskell data type declarations. Michael

Jeremy Shaw

6:07 p.m.

On Wed, Sep 29, 2010 at 5:21 AM, Michael Snoyman wrote:

...

I think this approach is not possible without involving some fairly ugly unsafeInterleaveIO/unsafePerformIO calls. A simple example using a common web programming example: support I have a multi-user blog site, where each user can have multiple entries. I would model this using standard Haskell datatypes as:

data Entry = Entry { title :: String, content :: String } data Blogger = Blogger { name :: String, entries :: [Entry] }

Obviously we'll need some kind of blogger loading function:

getBloggerByName :: String -> IO Blogger

That is pretty close to how it would look using happstack-state. Here is a complete, runnable example which defines the types, a query, creates/initializes the database, performs the query, and prints the results.

...

{-# LANGUAGE DeriveDataTypeable, FlexibleContexts, MultiParamTypeClasses, TemplateHaskell, TypeSynonymInstances, TypeFamilies #-} module Main where

import Control.Exception (bracket) import Control.Monad.Reader (ask) import Data.Data import Happstack.Data import Happstack.Data.IxSet import Happstack.State

A simple type to identify a particular blogger:

...

newtype Blogger = Blogger { name :: String } deriving (Eq, Ord, Read, Show, Data, Typeable) $(deriveSerialize ''Blogger) instance Version Blogger

The deriveSerialize instance automatically creates the instances for serializing and deserializing to/from a binary representation for storage, transmission, etc. The Version instance is used for migration when the data type changes. (Since there is no previous version of this type to migrate from, we don't have to specify anything). We create a similar type for the title of the blog post:

...

newtype Title = Title { unTitle :: String } deriving (Eq, Ord, Read, Show, Data, Typeable) $(deriveSerialize ''Title) instance Version Title

And a simple record which actually contains a blog post:

...

data Entry = Entry { title :: Title , blogger :: Blogger , content :: String } deriving (Eq, Ord, Read, Show, Data, Typeable) $(deriveSerialize ''Entry) instance Version Entry

Obviously, it could be expanded to support tags, posted date, whether or not in is published, etc. Next we create an IxSet which holds all the Entries that have been posted:

...

$(inferIxSet "Entries" ''Entry 'noCalcs [''Blogger, ''Title])

An IxSet is a bit like a normal Set, except it has indexes, which you can use for performing queries. In this case, we use Blogger and Title as indexes. Next we define a component that actually stores the Entries:

...

instance Component Entries where type Dependencies Entries = End initialValue = fromList [ Entry { title = Title "10 Reasons you should use Happstack." , blogger = Blogger "stepcut" , content = "..." } , Entry { title = Title "Persistence made easy!" , blogger = Blogger "Jeremy Shaw" , content = "..." } ]

This component is prepopulated with 2 entries. Now we want to define a query which retrieves all the entries by a particular Blogger:

...

getEntriesByBlogger :: Blogger -> Query Entries Entries getEntriesByBlogger blogger = do e <- ask return (e @= blogger)

The Query monad is essentially a specialized version of the Reader monad. So we use 'ask' to get the Entries from the Entries component. (@=) is an IxSet function which selects all the Entries with the specified blogger. Next we 'register' all the functions we want to use as queries for the Entries Component:

...

$(mkMethods ''Entries ['getEntriesByBlogger])

And finally, here is a main function which initializes the transaction system, performs a query, prints the results, and shuts the transaction system down:

...

main :: IO () main = bracket (startSystemState (Proxy :: Proxy Entries)) shutdownSystem $ \_ -> do postsByStepcut <- query (GetEntriesByBlogger (Blogger "stepcut")) print postsByStepcut

Note that there is no outside or additional configuration which needs to be done. If you have the happstack-state libraries installed on your system, then you can simply run this program. You do not need to configure or initialize any external database system. The queries and updates are thread-safe, ACID-transactions. You can use almost any Haskell datatype declared using the normal Haskell syntax. Basically, if you could write a pair of Read/Show instances for the type, then you can probably use it directly with happstack-state. So that means the type can not have functions, existentials, and a few other things. But Trees, etc, are no problem. The queries and updates are just straight-forward functions in the Reader and State monads. So, there is no special query language or DSL that you need to learn. You have the full, expressive power of Haskell at your disposal. - jeremy

Thomas M. DuBuisson

1 Oct 1 Oct

6:53 p.m.

New subject: happstack-ixset internals/performance (was Re: [Haskell-cafe] Inverse of HaskellDB)

...

That is pretty close to how it would look using happstack-state. Here is a complete, runnable example which defines the types, a query, creates/initializes the database, performs the query, and prints the results. [snip]

How is data stored in Happstack.State? I see the "Component" instance uses "fromList" from happstack-ixset but can't find any information on the algorithm used or its efficiency (computationally or wrt space). If making this more concrete helps then here is a possible use: == GPS Points == I have a GPS logger that logs every 10 seconds when I jog. Jogging for an hour a day for the past 180 days has resulted in 64k points. Pretending I hosted a site for joggers (and all points were in the same DB) I could easily result in a billion points (< 20K users). Would happstack-ixset code in the form "points @< (Lon -120) @> (Lon -125) @> (Lat 45) @< (Lat 50)" perform reasonably? Cheers, Thomas

Jeremy Shaw

2 Oct 2 Oct

7:09 p.m.

New subject: happstack-ixset internals/performance (was Re: [Haskell-cafe] Inverse of HaskellDB)

In the current version of IxSet, the performance of querying on just the Lon would be essentially the same as if you just had a "Data.Map Lon Point". But the queries on the second index are current not so great. There is work in progress to rewrite the internals of IxSet to be based on a kd-tree, in which case your query should be pretty efficient. So, that answer is pretty vague :) I am in the process of wrapping up happstack 0.6 which has focused on fixing some performance issues with happstack-server, and refactoring the code so that user API and internals are more clearly separated and better documented. happstack 0.7 is all about happstack-state. A key aspect will be nailing down some solid performance benchmarks instead of vague hand waving :) The numbers you give are certainly within the scope of what we would like 0.7 to be able to handle. Also, I should note that happstack-state and happstack-ixset are independent from each other. You can easily use something other than IxSet to store your points and still use happstack-state. - jeremy On Fri, Oct 1, 2010 at 1:53 PM, Thomas M. DuBuisson wrote:

...

...
That is pretty close to how it would look using happstack-state. Here is a complete, runnable example which defines the types, a query, creates/initializes the database, performs the query, and prints the results. [snip]

How is data stored in Happstack.State? I see the "Component" instance uses "fromList" from happstack-ixset but can't find any information on the algorithm used or its efficiency (computationally or wrt space).

If making this more concrete helps then here is a possible use:

== GPS Points == I have a GPS logger that logs every 10 seconds when I jog. Jogging for an hour a day for the past 180 days has resulted in 64k points. Pretending I hosted a site for joggers (and all points were in the same DB) I could easily result in a billion points (< 20K users). Would happstack-ixset code in the form "points @< (Lon -120) @> (Lon -125) @> (Lat 45) @< (Lat 50)" perform reasonably?

Cheers, Thomas

Thomas M. DuBuisson

8:09 p.m.

New subject: happstack-ixset internals/performance (was Re: [Haskell-cafe] Inverse of HaskellDB)

Thanks Jeremy, I just wrote up my own little analysis (below) while you were responding. I'll look for the kd-tree work; if I see discussion (and am stupid enough to heap more work onto my plate) then I might get involved. Oops, didn't send... Cheers, Thomas ----- So another glance tells me there is a list of maps (one element for each index method) and it uses Data.Map under the hood. So I have O(m lg n) where m is the number of index methods and n is the number of elements. Space wise, I think Data.Map takes up 6 words per Bin constructor (1 for the constructor, 1 for the 'Size' and one for the pointer indirection for each additional field), so the space is "6 * n * m * w" where w is the word size. This means indexing by 5 methods for 1M entries takes about 256MB, assuming 28B per entry that's 28MB of data + 256 indexing ~ 282MB needed. Indexing my imaginary 1B points by user,date,lat,lon is 6 * 2^30 * 4 * 8 - or about 192 GB of indexing + 28GB of data for 220GB total. Obviously I shouldn't be talking about keeping a live data set of 28GB in memory let alone indexing it all, but I was just curious about the ratio (220MB for 1M points, which is just one heavy user). On Sat, 2010-10-02 at 14:09 -0500, Jeremy Shaw wrote:

...

In the current version of IxSet, the performance of querying on just the Lon would be essentially the same as if you just had a "Data.Map Lon Point". But the queries on the second index are current not so great. There is work in progress to rewrite the internals of IxSet to be based on a kd-tree, in which case your query should be pretty efficient.

So, that answer is pretty vague :) I am in the process of wrapping up happstack 0.6 which has focused on fixing some performance issues with happstack-server, and refactoring the code so that user API and internals are more clearly separated and better documented.

happstack 0.7 is all about happstack-state. A key aspect will be nailing down some solid performance benchmarks instead of vague hand waving :)

The numbers you give are certainly within the scope of what we would like 0.7 to be able to handle. Also, I should note that happstack-state and happstack-ixset are independent from each other. You can easily use something other than IxSet to store your points and still use happstack-state.

- jeremy

On Fri, Oct 1, 2010 at 1:53 PM, Thomas M. DuBuisson wrote:

...
...
That is pretty close to how it would look using happstack-state. Here is a complete, runnable example which defines the types, a query, creates/initializes the database, performs the query, and prints the results. [snip]

How is data stored in Happstack.State? I see the "Component" instance uses "fromList" from happstack-ixset but can't find any information on the algorithm used or its efficiency (computationally or wrt space).

If making this more concrete helps then here is a possible use:

== GPS Points == I have a GPS logger that logs every 10 seconds when I jog. Jogging for an hour a day for the past 180 days has resulted in 64k points. Pretending I hosted a site for joggers (and all points were in the same DB) I could easily result in a billion points (< 20K users). Would happstack-ixset code in the form "points @< (Lon -120) @> (Lon -125) @> (Lat 45) @< (Lat 50)" perform reasonably?

Cheers, Thomas

Peter Robinson

10:49 p.m.

New subject: happstack-ixset internals/performance (was Re: [Haskell-cafe] Inverse of HaskellDB)

Hi, Thomas.

...

Thanks Jeremy, I just wrote up my own little analysis (below) while you were responding. I'll look for the kd-tree work; if I see discussion (and am stupid enough to heap more work onto my plate) then I might get involved.

You can find the repository for the dynamic kd-tree implementation here [1]. I'm currently rewriting large parts of the core algorithms (balancing and multi key traversal) and the implementation is far from being complete/usable. Once I'm done with these changes it's time for some serious benchmarking. The kd-tree implementation does seem to scale, as the last working version outperformed Data.Map w.r.t. space and time when considering large data sets (~1000000 elements). For single-key queries on small data sets, however, IxSet is currently still faster while memory consumption is about the same. I think the main advantage of using a kd-tree vs multiple Data.Maps, is that a query involving multiple keys can still happen in O(log n) time, as the tree needs to be traversed only once. Also, when an element is modified, most of the m Data.Maps need to be rebuilt (i.e. O(m*n*log n)) because the indexing information might be out of date. (This might have been optimized in recent versions of happstack-ixset.) For the kd-tree we can get away with rebalancing a subtree of some size k which takes O(k*log k) time. Peter [1] http://darcs.monoid.at/kdtree/

...

-----

So another glance tells me there is a list of maps (one element for each index method) and it uses Data.Map under the hood. So I have O(m lg n) where m is the number of index methods and n is the number of elements.

Space wise, I think Data.Map takes up 6 words per Bin constructor (1 for the constructor, 1 for the 'Size' and one for the pointer indirection for each additional field), so the space is "6 * n * m * w" where w is the word size. This means indexing by 5 methods for 1M entries takes about 256MB, assuming 28B per entry that's 28MB of data + 256 indexing ~ 282MB needed.

Indexing my imaginary 1B points by user,date,lat,lon is 6 * 2^30 * 4 * 8 - or about 192 GB of indexing + 28GB of data for 220GB total. Obviously I shouldn't be talking about keeping a live data set of 28GB in memory let alone indexing it all, but I was just curious about the ratio (220MB for 1M points, which is just one heavy user).

On Sat, 2010-10-02 at 14:09 -0500, Jeremy Shaw wrote:

...
In the current version of IxSet, the performance of querying on just the Lon would be essentially the same as if you just had a "Data.Map Lon Point". But the queries on the second index are current not so great. There is work in progress to rewrite the internals of IxSet to be based on a kd-tree, in which case your query should be pretty efficient.

So, that answer is pretty vague :) I am in the process of wrapping up happstack 0.6 which has focused on fixing some performance issues with happstack-server, and refactoring the code so that user API and internals are more clearly separated and better documented.

happstack 0.7 is all about happstack-state. A key aspect will be nailing down some solid performance benchmarks instead of vague hand waving :)

The numbers you give are certainly within the scope of what we would like 0.7 to be able to handle. Also, I should note that happstack-state and happstack-ixset are independent from each other. You can easily use something other than IxSet to store your points and still use happstack-state.

- jeremy

On Fri, Oct 1, 2010 at 1:53 PM, Thomas M. DuBuisson wrote:

...
...
That is pretty close to how it would look using happstack-state. Here is a complete, runnable example which defines the types, a query, creates/initializes the database, performs the query, and prints the results. [snip]

How is data stored in Happstack.State? I see the "Component" instance uses "fromList" from happstack-ixset but can't find any information on the algorithm used or its efficiency (computationally or wrt space).

If making this more concrete helps then here is a possible use:

== GPS Points == I have a GPS logger that logs every 10 seconds when I jog. Jogging for an hour a day for the past 180 days has resulted in 64k points. Pretending I hosted a site for joggers (and all points were in the same DB) I could easily result in a billion points (< 20K users). Would happstack-ixset code in the form "points @< (Lon -120) @> (Lon -125) @> (Lat 45) @< (Lat 50)" perform reasonably?

Cheers, Thomas

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Jeremy Shaw

11:22 p.m.

New subject: happstack-ixset internals/performance (was Re: [Haskell-cafe] Inverse of HaskellDB)

Hello, Check out these threads: http://groups.google.com/group/happs/browse_thread/thread/23d92e45c99f88b1 http://groups.google.com/group/happs/browse_thread/thread/0b0d0a9158c3ad73 There is nothing inherently wrong with keeping 28GB of real data in memory. It depends largely on what you are trying to optimize for.. For personal hobby projects, the cost of hosting the project seems to be the biggest expense that people are concerned worth. (Which is quite reasonable). As a result people are willing to increase development time, and sacrifice performance in order lower hosting costs. (Also reasonable). For commercial projects, the costs look different. For example, if buying a server with 1TB of RAM (around $80,000 these days) means they can hire one less developer (at $150,000/year when you look at total cost of salary, benefits, etc), then that is a savings of $70,000. Additionally, in RAM storage provides much more predictable performance. If you have ever attempted to use a site like reddit after its memcached servers have been reset, it is clear how bad the penalties for hitting disk are. RAM is significantly faster and than disk and has no seek time. RAM can achieve 100-1000x less latency and higher throughput. Facebook reportedly keeps as much as 80-90% of their working dataset in memcached. In 2008 they had 28TB of memcached servers. No idea what they are up to now (http://www.facebook.com/note.php?note_id=39391378919). Of course, there are datasets which are simply too big to be stored in RAM. For example, google's search index. I am not that familiar with their approach, but I believe they go the opposite extreme. They likely assume that every query is going to hit the disk, and optimize the system so that it can provide acceptable response even if nothing is cached ? (I am totally guessing here). The focus of the new IxSet internals is to help bring the indexing overhead ratio down (and increase speed at the same time). There are also some ideas, but not code, for how to build an IxSet-like structure which stores the keys in RAM, but can store the values of disk. That would, hopefully, give you the ease of use of IxSet, but with lower memory requirements if your keys are significantly smaller than the total value payload. The trade off being that you get back into the ugly work of disk seeks :) There was also some experimental work done where the values were stored in RAM, but in their serialized byte format, under the assumption that the serialized format is a lot more compact than the normal representation. The trade off is that the values must be deserialized everytime they are used, which requires more CPU. A more complex version could be imagined, where the deserialized version is cached for some period of time. Whether that is actually beneficial can only be determined empirically I think... So, clearly happstack-state is not the optimal choice for all haskell web applications. And that is why it is a completely independent from happstack-server. We do want to allow people to pick the best choice for their application. Of course, if we want to provide off-the-shelf components like a user account system -- then eventually we have to nail down some specifics, such as the persistence layer. But those choices would only apply to code that wants to use those optional components. At the same time, happstack-state is a very interesting and challenging library to work on. The recent rise in popularity of things like redis seems to validate happstack-state somewhat. - jeremy On Sat, Oct 2, 2010 at 3:09 PM, Thomas M. DuBuisson wrote:

...

Thanks Jeremy, I just wrote up my own little analysis (below) while you were responding. I'll look for the kd-tree work; if I see discussion (and am stupid enough to heap more work onto my plate) then I might get involved.

Oops, didn't send...

Cheers, Thomas

-----

So another glance tells me there is a list of maps (one element for each index method) and it uses Data.Map under the hood. So I have O(m lg n) where m is the number of index methods and n is the number of elements.

Space wise, I think Data.Map takes up 6 words per Bin constructor (1 for the constructor, 1 for the 'Size' and one for the pointer indirection for each additional field), so the space is "6 * n * m * w" where w is the word size. This means indexing by 5 methods for 1M entries takes about 256MB, assuming 28B per entry that's 28MB of data + 256 indexing ~ 282MB needed.

Indexing my imaginary 1B points by user,date,lat,lon is 6 * 2^30 * 4 * 8 - or about 192 GB of indexing + 28GB of data for 220GB total. Obviously I shouldn't be talking about keeping a live data set of 28GB in memory let alone indexing it all, but I was just curious about the ratio (220MB for 1M points, which is just one heavy user).

On Sat, 2010-10-02 at 14:09 -0500, Jeremy Shaw wrote:

...
In the current version of IxSet, the performance of querying on just the Lon would be essentially the same as if you just had a "Data.Map Lon Point". But the queries on the second index are current not so great. There is work in progress to rewrite the internals of IxSet to be based on a kd-tree, in which case your query should be pretty efficient.

So, that answer is pretty vague :) I am in the process of wrapping up happstack 0.6 which has focused on fixing some performance issues with happstack-server, and refactoring the code so that user API and internals are more clearly separated and better documented.

happstack 0.7 is all about happstack-state. A key aspect will be nailing down some solid performance benchmarks instead of vague hand waving :)

The numbers you give are certainly within the scope of what we would like 0.7 to be able to handle. Also, I should note that happstack-state and happstack-ixset are independent from each other. You can easily use something other than IxSet to store your points and still use happstack-state.

- jeremy

On Fri, Oct 1, 2010 at 1:53 PM, Thomas M. DuBuisson wrote:

...
...
That is pretty close to how it would look using happstack-state. Here is a complete, runnable example which defines the types, a query, creates/initializes the database, performs the query, and prints the results. [snip]

How is data stored in Happstack.State? I see the "Component" instance uses "fromList" from happstack-ixset but can't find any information on the algorithm used or its efficiency (computationally or wrt space).

If making this more concrete helps then here is a possible use:

== GPS Points == I have a GPS logger that logs every 10 seconds when I jog. Jogging for an hour a day for the past 180 days has resulted in 64k points. Pretending I hosted a site for joggers (and all points were in the same DB) I could easily result in a billion points (< 20K users). Would happstack-ixset code in the form "points @< (Lon -120) @> (Lon -125) @> (Lat 45) @< (Lat 50)" perform reasonably?

Cheers, Thomas

Jeremy Shaw

28 Sep 28 Sep

3:30 p.m.

Do you need to persistently store your Haskell data types in tables ? Or just persistently store them ? happstack-state provides the latter for you. Your data-types are just plain-old Haskell data types and your queries and updates are just functions in the Reader or State monad. It provides ACID properties with write-ahead logging to the disk, S3, or other backends. It also includes data-type versioning and automatic migration. Although it is associated with the happstack web application framework, there is nothing web specific about it. It does not even require that you build or install any of the web stuff in Happstack. To get a quick feel for how it works, I recommend this tutorial: http://nhlab.blogspot.com/2008/07/extending-asterisk-with-happs.html It is a little out of date -- but mostly you just need to change the imports from HAppS to Happstack. - jeremy On Sat, Sep 25, 2010 at 2:31 PM, Jonathan Geddes wrote:

...

Cafe,

HaskellDB takes a database schema and produces Haskell data structures (plus some other query-related stuff for its EDSL query language).

What I'm looking for is the inverse of this functionality. I want to create tables based on a Haskell data structure with a few simple rules. These rules include: if a field is not of the form `Maybe a' then it can't be nullable in the database. If a field is not a primitive (in the database) then it is actually stored in another table and a reference id is stored in the table. Tables are produced recursively, unless they already exist, etc.

The HaskellDB approach is great for interfacing with existing tables, but in my case I already have data structures and now I would like a quick way to create tables to persist them.

Does such a thing exist? If not, would you find it useful? I may take this up as a side project if it does not already exist and others would find it useful.

Thanks,

--Jonathan _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

5398

Age (days ago)

5405

Last active (days ago)

List overview

Download

17 comments

10 participants

participants (10)

Antoine Latter
Chris Eidhof
Christopher Done
Felipe Lessa
Jeremy Shaw
Jonathan Geddes
Michael Snoyman
Ozgur Akgun
Peter Robinson
Thomas M. DuBuisson