
Cafe, HaskellDB takes a database schema and produces Haskell data structures (plus some other query-related stuff for its EDSL query language). What I'm looking for is the inverse of this functionality. I want to create tables based on a Haskell data structure with a few simple rules. These rules include: if a field is not of the form `Maybe a' then it can't be nullable in the database. If a field is not a primitive (in the database) then it is actually stored in another table and a reference id is stored in the table. Tables are produced recursively, unless they already exist, etc. The HaskellDB approach is great for interfacing with existing tables, but in my case I already have data structures and now I would like a quick way to create tables to persist them. Does such a thing exist? If not, would you find it useful? I may take this up as a side project if it does not already exist and others would find it useful. Thanks, --Jonathan

That sounds pretty awesome to me.
Have you given any thought as to how you want to approach versioning?
Maybe I'm asking a silly question - I have very little real world experience
with relation databases and how to version schemas.
Antoine
On Sep 25, 2010 2:31 PM, "Jonathan Geddes"
Cafe,
HaskellDB takes a database schema and produces Haskell data structures (plus some other query-related stuff for its EDSL query language).
What I'm looking for is the inverse of this functionality. I want to create tables based on a Haskell data structure with a few simple rules. These rules include: if a field is not of the form `Maybe a' then it can't be nullable in the database. If a field is not a primitive (in the database) then it is actually stored in another table and a reference id is stored in the table. Tables are produced recursively, unless they already exist, etc.
The HaskellDB approach is great for interfacing with existing tables, but in my case I already have data structures and now I would like a quick way to create tables to persist them.
Does such a thing exist? If not, would you find it useful? I may take this up as a side project if it does not already exist and others would find it useful.
Thanks,
--Jonathan _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Versioning is a tricky problem regardless of how you are creating
tables. And that isn't the problem I was aiming to tackle; the problem
I was aiming to tackle is a bit more narrow than that: I have a record
and now I need a table to stick it in.
By the way, how does HaskellDB handle versioning?
--Jonathan
On Sat, Sep 25, 2010 at 1:57 PM, Antoine Latter
That sounds pretty awesome to me.
Have you given any thought as to how you want to approach versioning?
Maybe I'm asking a silly question - I have very little real world experience with relation databases and how to version schemas.
Antoine
On Sep 25, 2010 2:31 PM, "Jonathan Geddes"
wrote: Cafe,
HaskellDB takes a database schema and produces Haskell data structures (plus some other query-related stuff for its EDSL query language).
What I'm looking for is the inverse of this functionality. I want to create tables based on a Haskell data structure with a few simple rules. These rules include: if a field is not of the form `Maybe a' then it can't be nullable in the database. If a field is not a primitive (in the database) then it is actually stored in another table and a reference id is stored in the table. Tables are produced recursively, unless they already exist, etc.
The HaskellDB approach is great for interfacing with existing tables, but in my case I already have data structures and now I would like a quick way to create tables to persist them.
Does such a thing exist? If not, would you find it useful? I may take this up as a side project if it does not already exist and others would find it useful.
Thanks,
--Jonathan _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hey Jonathan, I've done some work on this. The hard part is defining relationships between datatypes: how do you model this in Haskell? I've some code on github: http://github.com/chriseidhof/persist, you might be interested in that. -chris On 25 sep 2010, at 21:31, Jonathan Geddes wrote:
Cafe,
HaskellDB takes a database schema and produces Haskell data structures (plus some other query-related stuff for its EDSL query language).
What I'm looking for is the inverse of this functionality. I want to create tables based on a Haskell data structure with a few simple rules. These rules include: if a field is not of the form `Maybe a' then it can't be nullable in the database. If a field is not a primitive (in the database) then it is actually stored in another table and a reference id is stored in the table. Tables are produced recursively, unless they already exist, etc.
The HaskellDB approach is great for interfacing with existing tables, but in my case I already have data structures and now I would like a quick way to create tables to persist them.
Does such a thing exist? If not, would you find it useful? I may take this up as a side project if it does not already exist and others would find it useful.
Thanks,
--Jonathan _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

How do you define relationships between data types?
Well, why is it any different from other fields? From one of your examples
[1], I'd expect you to have a list of questions in the Quiz data type, and
if necessary, a quiz field in the Question data type. This might be a bit
tricky but certainly achievable [2].
Something like the following:
data Quiz = Quiz {
description :: String,
subject :: String,
questions :: [Question]
} deriving (Show, Read)
data Question = Question {
title :: String,
choiceA :: String,
choiceB :: String,
choiceC :: String,
quiz :: Quiz
} deriving (Show, Read)
[1] http://github.com/chriseidhof/persist/blob/master/examples/Model.phs
[2] http://www.haskell.org/haskellwiki/Tying_the_Knot
On 28 September 2010 16:13, Chris Eidhof
Hey Jonathan,
I've done some work on this. The hard part is defining relationships between datatypes: how do you model this in Haskell? I've some code on github: http://github.com/chriseidhof/persist, you might be interested in that.
-chris
On 25 sep 2010, at 21:31, Jonathan Geddes wrote:
Cafe,
HaskellDB takes a database schema and produces Haskell data structures (plus some other query-related stuff for its EDSL query language).
What I'm looking for is the inverse of this functionality. I want to create tables based on a Haskell data structure with a few simple rules. These rules include: if a field is not of the form `Maybe a' then it can't be nullable in the database. If a field is not a primitive (in the database) then it is actually stored in another table and a reference id is stored in the table. Tables are produced recursively, unless they already exist, etc.
The HaskellDB approach is great for interfacing with existing tables, but in my case I already have data structures and now I would like a quick way to create tables to persist them.
Does such a thing exist? If not, would you find it useful? I may take this up as a side project if it does not already exist and others would find it useful.
Thanks,
--Jonathan _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Ozgur Akgun

On 28 sep 2010, at 17:33, Ozgur Akgun wrote:
How do you define relationships between data types?
Well, why is it any different from other fields? From one of your examples [1], I'd expect you to have a list of questions in the Quiz data type, and if necessary, a quiz field in the Question data type. This might be a bit tricky but certainly achievable [2].
This is really tricky. For example, consider storing a large tree in the database:
data Tree = Node Int Tree Tree | Leaf Int
This means you need to read the entire tree from the database. Or consider cyclic datastructures (such as the example you gave). How do you store this? The only way to inspect this is using a library like data-reify [1]. I think the problem might be a bit harder than you suspect. Another way to solve it is using Sebastiaan Visser's framework, described in his paper [2], but that's also rather complicated. -chris [1]: http://hackage.haskell.org/package/data-reify [2]: http://github.com/downloads/sebastiaanvisser/msc-thesis/wgp10-genstorage.pdf

OK, I am rephrasing it a bit then :)
I definitely don't think this would be trivial to implement. However, I'd
expect a decent solution to this problem, not to have special combinators to
describe relations between data types, but let the user model their data
using plain haskell data types, and infer the associated table structure
just by looking at the data types.
I'll give this a harder thought once I find the time. There is the huge
barrier of TH, stopping me from playing with things like this.
Anyway, have fun! :)
On 29 September 2010 10:41, Chris Eidhof
On 28 sep 2010, at 17:33, Ozgur Akgun wrote:
How do you define relationships between data types?
Well, why is it any different from other fields? From one of your examples [1], I'd expect you to have a list of questions in the Quiz data type, and if necessary, a quiz field in the Question data type. This might be a bit tricky but certainly achievable [2].
This is really tricky. For example, consider storing a large tree in the database:
data Tree = Node Int Tree Tree | Leaf Int
This means you need to read the entire tree from the database. Or consider cyclic datastructures (such as the example you gave). How do you store this? The only way to inspect this is using a library like data-reify [1].
I think the problem might be a bit harder than you suspect.
Another way to solve it is using Sebastiaan Visser's framework, described in his paper [2], but that's also rather complicated.
-chris
[1]: http://hackage.haskell.org/package/data-reify [2]: http://github.com/downloads/sebastiaanvisser/msc-thesis/wgp10-genstorage.pdf
-- Ozgur Akgun

I think this approach is not possible without involving some fairly
ugly unsafeInterleaveIO/unsafePerformIO calls. A simple example using
a common web programming example: support I have a multi-user blog
site, where each user can have multiple entries. I would model this
using standard Haskell datatypes as:
data Entry = Entry { title :: String, content :: String }
data Blogger = Blogger { name :: String, entries :: [Entry] }
Obviously we'll need some kind of blogger loading function:
getBloggerByName :: String -> IO Blogger
Either this will load up all entries (a potentially incredibly costly
operation) or use unsafe IO down the road. Especially when using
database connections, this can be incredibly bad: the connection could
be closed, the SQL statement could be reused by another request, etc.
My persistent library follows a similar approach to what Chris
describes, though with a very different syntax. You can see a very
similar example on the documentation site[1].
Michael
[1] http://docs.yesodweb.com/book/persistent/#relations
On Wed, Sep 29, 2010 at 12:01 PM, Ozgur Akgun
OK, I am rephrasing it a bit then :) I definitely don't think this would be trivial to implement. However, I'd expect a decent solution to this problem, not to have special combinators to describe relations between data types, but let the user model their data using plain haskell data types, and infer the associated table structure just by looking at the data types. I'll give this a harder thought once I find the time. There is the huge barrier of TH, stopping me from playing with things like this. Anyway, have fun! :)
On 29 September 2010 10:41, Chris Eidhof
wrote: On 28 sep 2010, at 17:33, Ozgur Akgun wrote:
How do you define relationships between data types?
Well, why is it any different from other fields? From one of your examples [1], I'd expect you to have a list of questions in the Quiz data type, and if necessary, a quiz field in the Question data type. This might be a bit tricky but certainly achievable [2].
This is really tricky. For example, consider storing a large tree in the database:
data Tree = Node Int Tree Tree | Leaf Int
This means you need to read the entire tree from the database. Or consider cyclic datastructures (such as the example you gave). How do you store this? The only way to inspect this is using a library like data-reify [1].
I think the problem might be a bit harder than you suspect.
Another way to solve it is using Sebastiaan Visser's framework, described in his paper [2], but that's also rather complicated.
-chris
[1]: http://hackage.haskell.org/package/data-reify [2]: http://github.com/downloads/sebastiaanvisser/msc-thesis/wgp10-genstorage.pdf
-- Ozgur Akgun
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Wed, Sep 29, 2010 at 7:21 AM, Michael Snoyman
I think this approach is not possible without involving some fairly ugly unsafeInterleaveIO/unsafePerformIO calls. A simple example using a common web programming example: support I have a multi-user blog site, where each user can have multiple entries. I would model this using standard Haskell datatypes as:
data Entry = Entry { title :: String, content :: String } data Blogger = Blogger { name :: String, entries :: [Entry] }
Obviously we'll need some kind of blogger loading function:
getBloggerByName :: String -> IO Blogger
Either this will load up all entries (a potentially incredibly costly operation) or use unsafe IO down the road. Especially when using database connections, this can be incredibly bad: the connection could be closed, the SQL statement could be reused by another request, etc.
It may be possible to tag those data fields that are not to be loaded on the spot. For example,
data Entry = Entry { title :: String, content :: String } data Blogger db = Blogger { name :: String, entries :: OnDB db [Entry] }
class Monad db => Database db where data OnDB db :: * -> * fetch :: OnDB db a -> db a fetchSome :: Criteria a -> OnDB db [a] -> db [a]
newtype InMemory a = InMemory a instance Database InMemory where newtype OnDB db a = OnDBMem a fetch (OnDBMem x) = return x fetchSome = ...
instance Database SQL where ...
Cheers, -- Felipe.

While we're on the topic of databases, I really wanted to try out query/inserting from/to my database with records like this: http://hpaste.org/40240/db_library_approach Define a record: data Person f = Person { pid :: f Integer , firstName :: f String , middleName :: f (Maybe String) , lastName :: f String , age :: f Integer } then I'd query it like personById :: Integer -> Query Person personById i = Person { pid = constant i , firstName = anything , middleName = anything , lastName = anything , age = anything } deriving (Typeable,Data) Or with a Data.Default instance: personById :: Integer -> Query Person personById i = def { pid = constant i } But I have yet to figure out how to derive a Typeable instance for such a type. I don't want to write any instances of anything. Technically I can get the field names and values using a Data.Data.Data instance, but I don't know, maybe I should make a TypeablePolymorphicKinds class or something and try to derive for it. UHC's generic deriving would probably be good for something like this, but I want GHC.

On Thu, Sep 30, 2010 at 4:35 PM, Felipe Lessa
On Wed, Sep 29, 2010 at 7:21 AM, Michael Snoyman
wrote: I think this approach is not possible without involving some fairly ugly unsafeInterleaveIO/unsafePerformIO calls. A simple example using a common web programming example: support I have a multi-user blog site, where each user can have multiple entries. I would model this using standard Haskell datatypes as:
data Entry = Entry { title :: String, content :: String } data Blogger = Blogger { name :: String, entries :: [Entry] }
Obviously we'll need some kind of blogger loading function:
getBloggerByName :: String -> IO Blogger
Either this will load up all entries (a potentially incredibly costly operation) or use unsafe IO down the road. Especially when using database connections, this can be incredibly bad: the connection could be closed, the SQL statement could be reused by another request, etc.
It may be possible to tag those data fields that are not to be loaded on the spot. For example,
data Entry = Entry { title :: String, content :: String } data Blogger db = Blogger { name :: String, entries :: OnDB db [Entry] }
class Monad db => Database db where data OnDB db :: * -> * fetch :: OnDB db a -> db a fetchSome :: Criteria a -> OnDB db [a] -> db [a]
newtype InMemory a = InMemory a instance Database InMemory where newtype OnDB db a = OnDBMem a fetch (OnDBMem x) = return x fetchSome = ...
instance Database SQL where ...
I wasn't claiming my approach was the *only* approach, just stating that it doesn't seem feasible to use the "simple" Haskell data type declarations. Michael

On Wed, Sep 29, 2010 at 5:21 AM, Michael Snoyman
I think this approach is not possible without involving some fairly ugly unsafeInterleaveIO/unsafePerformIO calls. A simple example using a common web programming example: support I have a multi-user blog site, where each user can have multiple entries. I would model this using standard Haskell datatypes as:
data Entry = Entry { title :: String, content :: String } data Blogger = Blogger { name :: String, entries :: [Entry] }
Obviously we'll need some kind of blogger loading function:
getBloggerByName :: String -> IO Blogger
That is pretty close to how it would look using happstack-state. Here is a complete, runnable example which defines the types, a query, creates/initializes the database, performs the query, and prints the results.
{-# LANGUAGE DeriveDataTypeable, FlexibleContexts, MultiParamTypeClasses, TemplateHaskell, TypeSynonymInstances, TypeFamilies #-} module Main where
import Control.Exception (bracket) import Control.Monad.Reader (ask) import Data.Data import Happstack.Data import Happstack.Data.IxSet import Happstack.State
A simple type to identify a particular blogger:
newtype Blogger = Blogger { name :: String } deriving (Eq, Ord, Read, Show, Data, Typeable) $(deriveSerialize ''Blogger) instance Version Blogger
The deriveSerialize instance automatically creates the instances for serializing and deserializing to/from a binary representation for storage, transmission, etc. The Version instance is used for migration when the data type changes. (Since there is no previous version of this type to migrate from, we don't have to specify anything). We create a similar type for the title of the blog post:
newtype Title = Title { unTitle :: String } deriving (Eq, Ord, Read, Show, Data, Typeable) $(deriveSerialize ''Title) instance Version Title
And a simple record which actually contains a blog post:
data Entry = Entry { title :: Title , blogger :: Blogger , content :: String } deriving (Eq, Ord, Read, Show, Data, Typeable) $(deriveSerialize ''Entry) instance Version Entry
Obviously, it could be expanded to support tags, posted date, whether or not in is published, etc. Next we create an IxSet which holds all the Entries that have been posted:
$(inferIxSet "Entries" ''Entry 'noCalcs [''Blogger, ''Title])
An IxSet is a bit like a normal Set, except it has indexes, which you can use for performing queries. In this case, we use Blogger and Title as indexes. Next we define a component that actually stores the Entries:
instance Component Entries where type Dependencies Entries = End initialValue = fromList [ Entry { title = Title "10 Reasons you should use Happstack." , blogger = Blogger "stepcut" , content = "..." } , Entry { title = Title "Persistence made easy!" , blogger = Blogger "Jeremy Shaw" , content = "..." } ]
This component is prepopulated with 2 entries. Now we want to define a query which retrieves all the entries by a particular Blogger:
getEntriesByBlogger :: Blogger -> Query Entries Entries getEntriesByBlogger blogger = do e <- ask return (e @= blogger)
The Query monad is essentially a specialized version of the Reader monad. So we use 'ask' to get the Entries from the Entries component. (@=) is an IxSet function which selects all the Entries with the specified blogger. Next we 'register' all the functions we want to use as queries for the Entries Component:
$(mkMethods ''Entries ['getEntriesByBlogger])
And finally, here is a main function which initializes the transaction system, performs a query, prints the results, and shuts the transaction system down:
main :: IO () main = bracket (startSystemState (Proxy :: Proxy Entries)) shutdownSystem $ \_ -> do postsByStepcut <- query (GetEntriesByBlogger (Blogger "stepcut")) print postsByStepcut
Note that there is no outside or additional configuration which needs to be done. If you have the happstack-state libraries installed on your system, then you can simply run this program. You do not need to configure or initialize any external database system. The queries and updates are thread-safe, ACID-transactions. You can use almost any Haskell datatype declared using the normal Haskell syntax. Basically, if you could write a pair of Read/Show instances for the type, then you can probably use it directly with happstack-state. So that means the type can not have functions, existentials, and a few other things. But Trees, etc, are no problem. The queries and updates are just straight-forward functions in the Reader and State monads. So, there is no special query language or DSL that you need to learn. You have the full, expressive power of Haskell at your disposal. - jeremy

That is pretty close to how it would look using happstack-state. Here is a complete, runnable example which defines the types, a query, creates/initializes the database, performs the query, and prints the results. [snip]
How is data stored in Happstack.State? I see the "Component" instance uses "fromList" from happstack-ixset but can't find any information on the algorithm used or its efficiency (computationally or wrt space). If making this more concrete helps then here is a possible use: == GPS Points == I have a GPS logger that logs every 10 seconds when I jog. Jogging for an hour a day for the past 180 days has resulted in 64k points. Pretending I hosted a site for joggers (and all points were in the same DB) I could easily result in a billion points (< 20K users). Would happstack-ixset code in the form "points @< (Lon -120) @> (Lon -125) @> (Lat 45) @< (Lat 50)" perform reasonably? Cheers, Thomas

In the current version of IxSet, the performance of querying on just
the Lon would be essentially the same as if you just had a "Data.Map
Lon Point". But the queries on the second index are current not so
great. There is work in progress to rewrite the internals of IxSet to
be based on a kd-tree, in which case your query should be pretty
efficient.
So, that answer is pretty vague :) I am in the process of wrapping up
happstack 0.6 which has focused on fixing some performance issues with
happstack-server, and refactoring the code so that user API and
internals are more clearly separated and better documented.
happstack 0.7 is all about happstack-state. A key aspect will be
nailing down some solid performance benchmarks instead of vague hand
waving :)
The numbers you give are certainly within the scope of what we would
like 0.7 to be able to handle. Also, I should note that
happstack-state and happstack-ixset are independent from each other.
You can easily use something other than IxSet to store your points and
still use happstack-state.
- jeremy
On Fri, Oct 1, 2010 at 1:53 PM, Thomas M. DuBuisson
That is pretty close to how it would look using happstack-state. Here is a complete, runnable example which defines the types, a query, creates/initializes the database, performs the query, and prints the results. [snip]
How is data stored in Happstack.State? I see the "Component" instance uses "fromList" from happstack-ixset but can't find any information on the algorithm used or its efficiency (computationally or wrt space).
If making this more concrete helps then here is a possible use:
== GPS Points == I have a GPS logger that logs every 10 seconds when I jog. Jogging for an hour a day for the past 180 days has resulted in 64k points. Pretending I hosted a site for joggers (and all points were in the same DB) I could easily result in a billion points (< 20K users). Would happstack-ixset code in the form "points @< (Lon -120) @> (Lon -125) @> (Lat 45) @< (Lat 50)" perform reasonably?
Cheers, Thomas

Thanks Jeremy, I just wrote up my own little analysis (below) while you were responding. I'll look for the kd-tree work; if I see discussion (and am stupid enough to heap more work onto my plate) then I might get involved. Oops, didn't send... Cheers, Thomas ----- So another glance tells me there is a list of maps (one element for each index method) and it uses Data.Map under the hood. So I have O(m lg n) where m is the number of index methods and n is the number of elements. Space wise, I think Data.Map takes up 6 words per Bin constructor (1 for the constructor, 1 for the 'Size' and one for the pointer indirection for each additional field), so the space is "6 * n * m * w" where w is the word size. This means indexing by 5 methods for 1M entries takes about 256MB, assuming 28B per entry that's 28MB of data + 256 indexing ~ 282MB needed. Indexing my imaginary 1B points by user,date,lat,lon is 6 * 2^30 * 4 * 8 - or about 192 GB of indexing + 28GB of data for 220GB total. Obviously I shouldn't be talking about keeping a live data set of 28GB in memory let alone indexing it all, but I was just curious about the ratio (220MB for 1M points, which is just one heavy user). On Sat, 2010-10-02 at 14:09 -0500, Jeremy Shaw wrote:
In the current version of IxSet, the performance of querying on just the Lon would be essentially the same as if you just had a "Data.Map Lon Point". But the queries on the second index are current not so great. There is work in progress to rewrite the internals of IxSet to be based on a kd-tree, in which case your query should be pretty efficient.
So, that answer is pretty vague :) I am in the process of wrapping up happstack 0.6 which has focused on fixing some performance issues with happstack-server, and refactoring the code so that user API and internals are more clearly separated and better documented.
happstack 0.7 is all about happstack-state. A key aspect will be nailing down some solid performance benchmarks instead of vague hand waving :)
The numbers you give are certainly within the scope of what we would like 0.7 to be able to handle. Also, I should note that happstack-state and happstack-ixset are independent from each other. You can easily use something other than IxSet to store your points and still use happstack-state.
- jeremy
On Fri, Oct 1, 2010 at 1:53 PM, Thomas M. DuBuisson
wrote: That is pretty close to how it would look using happstack-state. Here is a complete, runnable example which defines the types, a query, creates/initializes the database, performs the query, and prints the results. [snip]
How is data stored in Happstack.State? I see the "Component" instance uses "fromList" from happstack-ixset but can't find any information on the algorithm used or its efficiency (computationally or wrt space).
If making this more concrete helps then here is a possible use:
== GPS Points == I have a GPS logger that logs every 10 seconds when I jog. Jogging for an hour a day for the past 180 days has resulted in 64k points. Pretending I hosted a site for joggers (and all points were in the same DB) I could easily result in a billion points (< 20K users). Would happstack-ixset code in the form "points @< (Lon -120) @> (Lon -125) @> (Lat 45) @< (Lat 50)" perform reasonably?
Cheers, Thomas

Hi, Thomas.
Thanks Jeremy, I just wrote up my own little analysis (below) while you were responding. I'll look for the kd-tree work; if I see discussion (and am stupid enough to heap more work onto my plate) then I might get involved.
You can find the repository for the dynamic kd-tree implementation here [1]. I'm currently rewriting large parts of the core algorithms (balancing and multi key traversal) and the implementation is far from being complete/usable. Once I'm done with these changes it's time for some serious benchmarking. The kd-tree implementation does seem to scale, as the last working version outperformed Data.Map w.r.t. space and time when considering large data sets (~1000000 elements). For single-key queries on small data sets, however, IxSet is currently still faster while memory consumption is about the same. I think the main advantage of using a kd-tree vs multiple Data.Maps, is that a query involving multiple keys can still happen in O(log n) time, as the tree needs to be traversed only once. Also, when an element is modified, most of the m Data.Maps need to be rebuilt (i.e. O(m*n*log n)) because the indexing information might be out of date. (This might have been optimized in recent versions of happstack-ixset.) For the kd-tree we can get away with rebalancing a subtree of some size k which takes O(k*log k) time. Peter [1] http://darcs.monoid.at/kdtree/
-----
So another glance tells me there is a list of maps (one element for each index method) and it uses Data.Map under the hood. So I have O(m lg n) where m is the number of index methods and n is the number of elements.
Space wise, I think Data.Map takes up 6 words per Bin constructor (1 for the constructor, 1 for the 'Size' and one for the pointer indirection for each additional field), so the space is "6 * n * m * w" where w is the word size. This means indexing by 5 methods for 1M entries takes about 256MB, assuming 28B per entry that's 28MB of data + 256 indexing ~ 282MB needed.
Indexing my imaginary 1B points by user,date,lat,lon is 6 * 2^30 * 4 * 8 - or about 192 GB of indexing + 28GB of data for 220GB total. Obviously I shouldn't be talking about keeping a live data set of 28GB in memory let alone indexing it all, but I was just curious about the ratio (220MB for 1M points, which is just one heavy user).
On Sat, 2010-10-02 at 14:09 -0500, Jeremy Shaw wrote:
In the current version of IxSet, the performance of querying on just the Lon would be essentially the same as if you just had a "Data.Map Lon Point". But the queries on the second index are current not so great. There is work in progress to rewrite the internals of IxSet to be based on a kd-tree, in which case your query should be pretty efficient.
So, that answer is pretty vague :) I am in the process of wrapping up happstack 0.6 which has focused on fixing some performance issues with happstack-server, and refactoring the code so that user API and internals are more clearly separated and better documented.
happstack 0.7 is all about happstack-state. A key aspect will be nailing down some solid performance benchmarks instead of vague hand waving :)
The numbers you give are certainly within the scope of what we would like 0.7 to be able to handle. Also, I should note that happstack-state and happstack-ixset are independent from each other. You can easily use something other than IxSet to store your points and still use happstack-state.
- jeremy
On Fri, Oct 1, 2010 at 1:53 PM, Thomas M. DuBuisson
wrote: That is pretty close to how it would look using happstack-state. Here is a complete, runnable example which defines the types, a query, creates/initializes the database, performs the query, and prints the results. [snip]
How is data stored in Happstack.State? I see the "Component" instance uses "fromList" from happstack-ixset but can't find any information on the algorithm used or its efficiency (computationally or wrt space).
If making this more concrete helps then here is a possible use:
== GPS Points == I have a GPS logger that logs every 10 seconds when I jog. Jogging for an hour a day for the past 180 days has resulted in 64k points. Pretending I hosted a site for joggers (and all points were in the same DB) I could easily result in a billion points (< 20K users). Would happstack-ixset code in the form "points @< (Lon -120) @> (Lon -125) @> (Lat 45) @< (Lat 50)" perform reasonably?
Cheers, Thomas
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hello,
Check out these threads:
http://groups.google.com/group/happs/browse_thread/thread/23d92e45c99f88b1
http://groups.google.com/group/happs/browse_thread/thread/0b0d0a9158c3ad73
There is nothing inherently wrong with keeping 28GB of real data in
memory. It depends largely on what you are trying to optimize for..
For personal hobby projects, the cost of hosting the project seems to
be the biggest expense that people are concerned worth. (Which is
quite reasonable). As a result people are willing to increase
development time, and sacrifice performance in order lower hosting
costs. (Also reasonable).
For commercial projects, the costs look different. For example, if
buying a server with 1TB of RAM (around $80,000 these days) means they
can hire one less developer (at $150,000/year when you look at total
cost of salary, benefits, etc), then that is a savings of $70,000.
Additionally, in RAM storage provides much more predictable
performance. If you have ever attempted to use a site like reddit
after its memcached servers have been reset, it is clear how bad the
penalties for hitting disk are. RAM is significantly faster and than
disk and has no seek time. RAM can achieve 100-1000x less latency and
higher throughput.
Facebook reportedly keeps as much as 80-90% of their working dataset
in memcached. In 2008 they had 28TB of memcached servers. No idea what
they are up to now
(http://www.facebook.com/note.php?note_id=39391378919).
Of course, there are datasets which are simply too big to be stored in
RAM. For example, google's search index. I am not that familiar with
their approach, but I believe they go the opposite extreme. They
likely assume that every query is going to hit the disk, and optimize
the system so that it can provide acceptable response even if nothing
is cached ? (I am totally guessing here).
The focus of the new IxSet internals is to help bring the indexing
overhead ratio down (and increase speed at the same time).
There are also some ideas, but not code, for how to build an
IxSet-like structure which stores the keys in RAM, but can store the
values of disk. That would, hopefully, give you the ease of use of
IxSet, but with lower memory requirements if your keys are
significantly smaller than the total value payload. The trade off
being that you get back into the ugly work of disk seeks :)
There was also some experimental work done where the values were
stored in RAM, but in their serialized byte format, under the
assumption that the serialized format is a lot more compact than the
normal representation. The trade off is that the values must be
deserialized everytime they are used, which requires more CPU. A more
complex version could be imagined, where the deserialized version is
cached for some period of time. Whether that is actually beneficial
can only be determined empirically I think...
So, clearly happstack-state is not the optimal choice for all haskell
web applications. And that is why it is a completely independent from
happstack-server. We do want to allow people to pick the best choice
for their application. Of course, if we want to provide off-the-shelf
components like a user account system -- then eventually we have to
nail down some specifics, such as the persistence layer. But those
choices would only apply to code that wants to use those optional
components.
At the same time, happstack-state is a very interesting and
challenging library to work on. The recent rise in popularity of
things like redis seems to validate happstack-state somewhat.
- jeremy
On Sat, Oct 2, 2010 at 3:09 PM, Thomas M. DuBuisson
Thanks Jeremy, I just wrote up my own little analysis (below) while you were responding. I'll look for the kd-tree work; if I see discussion (and am stupid enough to heap more work onto my plate) then I might get involved.
Oops, didn't send...
Cheers, Thomas
-----
So another glance tells me there is a list of maps (one element for each index method) and it uses Data.Map under the hood. So I have O(m lg n) where m is the number of index methods and n is the number of elements.
Space wise, I think Data.Map takes up 6 words per Bin constructor (1 for the constructor, 1 for the 'Size' and one for the pointer indirection for each additional field), so the space is "6 * n * m * w" where w is the word size. This means indexing by 5 methods for 1M entries takes about 256MB, assuming 28B per entry that's 28MB of data + 256 indexing ~ 282MB needed.
Indexing my imaginary 1B points by user,date,lat,lon is 6 * 2^30 * 4 * 8 - or about 192 GB of indexing + 28GB of data for 220GB total. Obviously I shouldn't be talking about keeping a live data set of 28GB in memory let alone indexing it all, but I was just curious about the ratio (220MB for 1M points, which is just one heavy user).
On Sat, 2010-10-02 at 14:09 -0500, Jeremy Shaw wrote:
In the current version of IxSet, the performance of querying on just the Lon would be essentially the same as if you just had a "Data.Map Lon Point". But the queries on the second index are current not so great. There is work in progress to rewrite the internals of IxSet to be based on a kd-tree, in which case your query should be pretty efficient.
So, that answer is pretty vague :) I am in the process of wrapping up happstack 0.6 which has focused on fixing some performance issues with happstack-server, and refactoring the code so that user API and internals are more clearly separated and better documented.
happstack 0.7 is all about happstack-state. A key aspect will be nailing down some solid performance benchmarks instead of vague hand waving :)
The numbers you give are certainly within the scope of what we would like 0.7 to be able to handle. Also, I should note that happstack-state and happstack-ixset are independent from each other. You can easily use something other than IxSet to store your points and still use happstack-state.
- jeremy
On Fri, Oct 1, 2010 at 1:53 PM, Thomas M. DuBuisson
wrote: That is pretty close to how it would look using happstack-state. Here is a complete, runnable example which defines the types, a query, creates/initializes the database, performs the query, and prints the results. [snip]
How is data stored in Happstack.State? I see the "Component" instance uses "fromList" from happstack-ixset but can't find any information on the algorithm used or its efficiency (computationally or wrt space).
If making this more concrete helps then here is a possible use:
== GPS Points == I have a GPS logger that logs every 10 seconds when I jog. Jogging for an hour a day for the past 180 days has resulted in 64k points. Pretending I hosted a site for joggers (and all points were in the same DB) I could easily result in a billion points (< 20K users). Would happstack-ixset code in the form "points @< (Lon -120) @> (Lon -125) @> (Lat 45) @< (Lat 50)" perform reasonably?
Cheers, Thomas

Do you need to persistently store your Haskell data types in tables ?
Or just persistently store them ?
happstack-state provides the latter for you. Your data-types are just
plain-old Haskell data types and your queries and updates are just
functions in the Reader or State monad. It provides ACID properties
with write-ahead logging to the disk, S3, or other backends. It also
includes data-type versioning and automatic migration. Although it is
associated with the happstack web application framework, there is
nothing web specific about it. It does not even require that you build
or install any of the web stuff in Happstack.
To get a quick feel for how it works, I recommend this tutorial:
http://nhlab.blogspot.com/2008/07/extending-asterisk-with-happs.html
It is a little out of date -- but mostly you just need to change the
imports from HAppS to Happstack.
- jeremy
On Sat, Sep 25, 2010 at 2:31 PM, Jonathan Geddes
Cafe,
HaskellDB takes a database schema and produces Haskell data structures (plus some other query-related stuff for its EDSL query language).
What I'm looking for is the inverse of this functionality. I want to create tables based on a Haskell data structure with a few simple rules. These rules include: if a field is not of the form `Maybe a' then it can't be nullable in the database. If a field is not a primitive (in the database) then it is actually stored in another table and a reference id is stored in the table. Tables are produced recursively, unless they already exist, etc.
The HaskellDB approach is great for interfacing with existing tables, but in my case I already have data structures and now I would like a quick way to create tables to persist them.
Does such a thing exist? If not, would you find it useful? I may take this up as a side project if it does not already exist and others would find it useful.
Thanks,
--Jonathan _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (10)
-
Antoine Latter
-
Chris Eidhof
-
Christopher Done
-
Felipe Lessa
-
Jeremy Shaw
-
Jonathan Geddes
-
Michael Snoyman
-
Ozgur Akgun
-
Peter Robinson
-
Thomas M. DuBuisson