relational data representation in memory using haskell?

I'm kind of stuck that's why I'm posting here to ask wether this makes sense at all, maybe someone else has already done it? What I'd like to have: Some way representing relational data which is typically stored in databases such as Postgresql.. Rewriting something like Postgresql in haskell would take ages.. So I'd be satisfied with having in memory representation only (this would fit the HAppS state system very well .. :) Why ? * type safety * less conversions compared to SQL data * no need to switch processes, parse SQL etc so maybe it's even faster? (a small benchmark showed that inserting 20000 Ints into a list was 8 times faster than using MySQL parsing 20000 INSERT INTO x (1) statements ) I'd like to illustrate two different ideas using a small example: (A) data CD = CD { title :: String, tracks :: [ Track ] } data Track = Track { track :: String, cd :: CD } data PDB = PDB { cds :: Set CD, tracks :: Set Track } because it's not using foreign ids but kind of pointers I'll call this the pointer method using uniq ids it would look like this: (B) data CD = CD { id : Int, title :: String, tracks :: [Int ] } data Track = Track { trackId :: Int, track :: String, cd :: Int } data IDB = IDB { cds :: Map Int CD, tracks :: Map Int Track } I will call it I DB (I = using ids) PDB: pro : * less work when doing joins (no need to look foreign rows up) con : * you need uniq ids or such when serializing to disk * When updating a track you'll also have to update the pointer stored in cds. and if you had another table shelfs.. this had to be updated as well.. IDB: the other way round I find the idea not using any lookups when using joins appealing. Of course having a simple data Table = Table Map UniqId Rec isn't enough, sometimes you need more than one index or even a multi index: data Table = Table { byId :: Map Int Rec , byNameAndAge :: Map String (Map Int (Set Rec)) } Note that I've used Set here as well because this index does'nt have to be uniq! starting to write an insertTable :: Table -> Rec -> Table more than twice is getting tedious.. Of course you can start using some type hackery to insert a rec into all maps automatically.. but you'll get into trouble making the type system apply the best index not the first matching one. (I bet this could be done using HList etc somehow as well.. ) So my current attempt is defining the database using some data types and make template haskell derive those insertIntoTable and update functions. I've added the draft below. But before continuing spending much time on it I'd like to get your advice: Is there a chance that it will pay off? Some general considerations: haskell solution con: haskell can get close to C but in general it may be >10 times slower when not caring too much about design or writing low level (see recent thread about md5 or one where David Roundy has said something about a matrix thread: only 10 times slower?) Using a garbage collector on database data (some hundred MB) might not be the optimal way because I feel you can tell exactly when you no longer need a piece of allocated memory here? So some time might be wasted. projects tend to run longer as expected.. And if data no longer fits into memory .. :(... -> bad performance I think systems such as postgresql do scale much better if you have some gbs of data and only use the most recent X records frequently.. So maybe you'll have to spend time later which you've won by using a haskell relational data representation in memory only.. Another solution: use clusters - I don't have any experience. pro: much more safety (STM, type system ..) there are less possibilities making compared to C / PHP etc Do you also think (A) is more interesting because some load (looking up foreign keys) is moved on insert / delete and update operations taking less time in but are called more frequently thus maybe reducing peak load on queries? Of course some time would have to be spend on queries wich might look like this: let queryresult = $(query ( tables + constraints + relations ) ) db automatically generating the query function taking into account expected index cardinality etc.. Any comments, suggestions, links to existing solutions (except coddfish, haskelldb) ? Marc Weber draft ============= types represeting tables and db ======================== module RDMH.Types where import Language.Haskell.TH import Language.Haskell.TH.Syntax data Uniqueness = Uniq | NotUniq deriving (Show, Eq) data ModifyMode = InsertOnly | UpdateInsert | UpdateInsertDelete deriving (Show, Eq) type TypeS = String -- a name of a data type (data A = ..) data Index = I { uniqueness :: Uniqueness , key :: Exp -- a fuction rec -> key , subkeys :: [ Index ] , keyType :: TypeS -- the key type (this *should* be determined by the function.. But I don't know yet how to do it.. Probably using some GHC API magic -- , expectedCardinality :: Int -- how should this look like exactly? This will be used to select the best indexes } deriving (Show, Eq) -- Int is negative weight / cardinality. -- If you issue a query having two contstraints on types beeing indexed the one having biggest cardinality is taken first -- first type:` -- convinient function to get the type of a name - see TestCase.hs t = ConT . mkName -- type (either key or row) k = VarE . mkName -- gets the key -- app (VarE $ mkName "fst" ) -- a table representation data Table = Table { tableName :: String , row :: TypeS -- this row type must be an instance of Eq , indexes :: [ Index ] -- [ ( [ mkName "Int", mkName "String" ]] is represented as Map Int ( Map String rec ) -- , history :: Bool -- to be implemented. Each time a record is updated the original row should be saved somehow somewhere , modify :: ModifyMode -- , insertTimeStamp :: Bool -- , updateTimeStamp :: Bool } deriving (Eq, Show) data DB = DB { dbName :: String , tables :: [ Table ] , oneToN :: [ (Table, Table, String) ] -- String must be set if you want two relations to the same table (eg one ticket has foreign inbound and outbound flight) -- , nToM :: [ ( Table, Table ) ] } ============= example use case ======================================= {-# OPTIONS_GHC -XTemplateHaskell #-} module TestCase where import Language.Haskell.TH import RDMH.Types import RDMH.TH import Data.Time -- the data data RCD = CD { title :: String , artist :: String , year :: Int } data RTrack = Track { trackTitle :: String , recordingDate :: UTCTime } -- a track table with two indexes title and recordingDate trackT = Table "track" (t "RTrack") -- row type -- indexes: [ I NotUniq (k "trackTitle" ) [] (t "String") , I NotUniq (k "recordingDate" ) [] (t "UTCTime") ] UpdateInsertDelete -- a cd table with two indexes title and year cdT = Table "cd" (t "RCD" ) -- row type -- indexes: [ I Uniq (k "title" ) [] (t "String") , I NotUniq (k "year" ) [] (t "Int") ] UpdateInsertDelete -- the db: both tables and a simple relation db = DB "cdDB" [ cdT, trackT ] [ OneToN cdT trackT "" ) ] $(createDB db)

At Thu, 22 May 2008 01:04:24 +0200, Marc Weber wrote:
Some way representing relational data which is typically stored in databases such as Postgresql..
Rewriting something like Postgresql in haskell would take ages.. So I'd be satisfied with having in memory representation only (this would fit the HAppS state system very well .. :)
Are you familiar with the HAppS IxSet library? You would do something like: $( deriveAll [''Ord,''Eq,''Read,''Show,''Default] [d| data CD = CD AlbumTitle Artist [Track] newtype Artist = Artist String newtype AlbumTitle = AlbumTitle String data Track = Track TrackTitle TrackIndex newtype TrackIndex = TrackIndex Int newtype TrackTitle = TrackTitle String |]) $(inferIxSet "CDS" 'noCalcs [''AlbumTitle, ''TrackTitle, ''Artist]) This creates a table with indexs on AlbumTitle, TrackTitle, and Artist. You can do a simple query like: mycds @= (Artist "Wesley Willis") to get all the tracks by Wesley Willis. You should be able to build joins, etc on top of that. j.

At Thu, 22 May 2008 01:04:24 +0200, Marc Weber wrote:
Some way representing relational data which is typically stored in databases such as Postgresql..
Rewriting something like Postgresql in haskell would take ages.. So I'd be satisfied with having in memory representation only (this would fit the HAppS state system very well .. :)
Are you familiar with the HAppS IxSet library? Yes - not with all that sybwith-class stuff though. There are some issues: its dynamic : doesn't this waste some CPU cycles? no multi indexes.. maybe some space leaks because the data type containing the Maps is build after each filter maybe leaving unevaluating chunks - Saizan has told me about it on HAppS.. And you can't extend it to the degree I'd
On Wed, May 21, 2008 at 05:05:21PM -0700, Jeremy Shaw wrote: like to (eg throw a query at it and let the system figure out which indexes to use) And last but not least: It does'nt support relations at all yet. So all the effort adding / checking foreign keys etc has to be done anyway. Thanks Marc

Consider SQLite [1], which is "a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine." It is embeddable, can reside completely in memory (including the data), and can be saved and restored to disk when needed. It neatly fills the niche between maps and a client/server database model. It has a C API which you can wrap as needed with the FFI, and you wouldn't need more than a dozen or so functions to start with (it understands SQL too). [1] http://www.sqlite.org/ Marc Weber wrote:
At Thu, 22 May 2008 01:04:24 +0200, Marc Weber wrote:
Some way representing relational data which is typically stored in databases such as Postgresql..
Rewriting something like Postgresql in haskell would take ages.. So I'd be satisfied with having in memory representation only (this would fit the HAppS state system very well .. :) Are you familiar with the HAppS IxSet library? Yes - not with all that sybwith-class stuff though. There are some issues: its dynamic : doesn't this waste some CPU cycles? no multi indexes.. maybe some space leaks because the data type containing the Maps is build after each filter maybe leaving unevaluating chunks - Saizan has told me about it on HAppS.. And you can't extend it to the degree I'd
On Wed, May 21, 2008 at 05:05:21PM -0700, Jeremy Shaw wrote: like to (eg throw a query at it and let the system figure out which indexes to use) And last but not least: It does'nt support relations at all yet. So all the effort adding / checking foreign keys etc has to be done anyway.
Thanks Marc _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Wed, May 21, 2008 at 06:07:15PM -0700, Dan Weston wrote:
Consider SQLite [1], which is "a software library that implements a [..] It has a C API which you can wrap as needed with the FFI, and you wouldn't need more than a dozen or so functions to start with (it understands SQL too). So it has kind of API enabling me inserting rows without using SQL? I still have to do some marshalling to / from C and synchronize db layout and haskell data types.
Marc Weber

2008/5/22 Marc Weber
I'd like to illustrate two different ideas using a small example: (A) data CD = CD { title :: String, tracks :: [ Track ] } data Track = Track { track :: String, cd :: CD } data PDB = PDB { cds :: Set CD, tracks :: Set Track }
because it's not using foreign ids but kind of pointers I'll call this the pointer method
This doesn't look like a relational structure at all in Haskell. Let's take the CD and Track relations. In a relational database you have something like: CD (1, 'Querying about you') Track (1, 'Inserting some love', 1) Track (2, 'Updating my feelings', 1) Track (3, 'Deleting my hopes', 1) In an imperative language you can do something similar in memory using objects (you can in haskell to with IORefs and so on, but let's stay on "data"). You get something like: 0x000 CD('Querying about you') 0x004 Track('Inserting some love, 0x004) ... In Haskell when you say:
data Track = Track { track :: String, cd :: CD }
You are not storing in Track a reference, a pointer or something similar to a CD, you are storing a *value* (low level you probably have a pointer, but you have not pointer semantics). As you noticed, you cannot "update" the CD title without changing each Track. That's a way to store information, and a good way too, but it's not a relational structure by any extent. If you want to use this structure for your relational data you need two things: 1) Something that will convert from a value-based representation of data to something relational (aka ORM in the OO world... a FRM? VRM?). 2) A relational storage (internal or external). If you want to use "normal" Haskell ADT, are you sure that a relational storage is what you want? Keeping that in memory doesn't give you some advantages of relational databases (e.g. uniform representation), and the impedance between the functional and the relational world is not easy to manage. Maybe I misunderstood what you are trying to accomplish, and you only want to do a generic data structure with fast lookups on the content of the items? Or do you really need relational semantics? Salvatore

On Thu, May 22, 2008 at 08:16:54AM +0200, Salvatore Insalaco wrote:
2008/5/22 Marc Weber
: I'd like to illustrate two different ideas using a small example: (A) data CD = CD { title :: String, tracks :: [ Track ] } data Track = Track { track :: String, cd :: CD } data PDB = PDB { cds :: Set CD, tracks :: Set Track }
because it's not using foreign ids but kind of pointers I'll call this the pointer method
This doesn't look like a relational structure at all in Haskell. Let's take the CD and Track relations. In a relational database you have something like: CD (1, 'Querying about you') Track (1, 'Inserting some love', 1) Track (2, 'Updating my feelings', 1) Track (3, 'Deleting my hopes', 1)
In an imperative language you can do something similar in memory using objects (you can in haskell to with IORefs and so on, but let's stay on "data"). You get something like:
0x000 CD('Querying about you') 0x004 Track('Inserting some love, 0x004) You are right. But ghc does represent those things as pointers.. :) So it is indeed unless you ask ghc to not use (eg by using strict fields etc), correct? IORefs are not that good because you can't read them within STM. But you are right: Using IORefs was my first idea.
similar to a CD, you are storing a *value* (low level you probably have a pointer, but you have not pointer semantics). As you noticed, you cannot "update" the CD title without changing each Track. That's a way to store information, and a good way too, but it's not a relational structure by any extent. I agree it's not in general and higly implementation dependant Maybe I'm totally wrong. I imagine ghc having some internal representation of data types which come close to
struct CD { Track * tracks; title ** char; } cd; struct Track { cd * CD; title ** char; } track; (does'nt compile but you get the idea. My C knowldge is good enough for reading only). So if you start seeing the whole database as list of connected structs vio pointers adding / deleting/ inserting is quite the same as adding some nodes to a Data.Map. You replace the nodes you have to replace and finally get a poiter pointing to te new database state. As long as you don't loose the original "pointer" you can easily rollpack. Consider let x = Cd ... forkIO $ ( do something with x } -- (1) print x -- (2) How can ghc know when running line (2) that (1) hasen't changed the record? I see two solutions: a) give the forked process a copy (Then my design will collapse) but this is expensive to copy data without knowing you ned to b) use pointers and replace x ony on updating. Thus if (1) changes the title a new struct wil be created poiting to the old list but a new title String. line (2) doesn't have to care at all. I'm guessing that analyzing a) when values have to be copied is complicated - thus b) is implemented ? quicksilver has told me so as well - But I'm not sure that's why I'm asking.
If you want to use "normal" Haskell ADT, are you sure that a relational storage is what you want? Keeping that in memory doesn't give you some advantages of relational databases (e.g. uniform representation), and the impedance between the functional and the relational world is not easy to manage.
Maybe I misunderstood what you are trying to accomplish, and you only want to do a generic data structure with fast lookups on the content of the items? Exactly.. Of course I only want a generic data structure with fast lookups and content of items.. But I don't want to write the insert delete and update functions for each table again and again.. Does this already exist?
Marc Weber

Consider
let x = Cd ... forkIO $ ( do something with x } -- (1) print x -- (2)
How can ghc know when running line (2) that (1) hasen't changed the record? I see two solutions: a) give the forked process a copy (Then my design will collapse) but this is expensive to copy data without knowing you ned to b) use pointers and replace x ony on updating. Thus if (1) changes the title a new struct wil be created poiting to the old list but a new title String. line (2) doesn't have to care at all.
GHC knows that because in Haskell isn't possible to "update" x. x is not a variable, it's a binding. To put it simply: with IORefs (and STRefs, MVars, ...) you have references to values that you can change (inside their respective monads), much like variables, but data declarations are values, not references to values (even if GHC stores them as pointers you cannot treat them as such), so you cannot update it. So, in your example, you have more or less a relation (CD) where all the "columns" are part of primary key (and so they are not mutable). Salvatore

Hi Salvatore On Thu, May 22, 2008 at 11:01:01AM +0200, Salvatore Insalaco wrote:
Consider
let x = Cd ... forkIO $ ( do something with x } -- (1) print x -- (2)
How can ghc know when running line (2) that (1) hasen't changed the record? I see two solutions: a) give the forked process a copy (Then my design will collapse) but this is expensive to copy data without knowing you ned to b) use pointers and replace x ony on updating. Thus if (1) changes the title a new struct wil be created poiting to the old list but a new title String. line (2) doesn't have to care at all.
GHC knows that because in Haskell isn't possible to "update" x. x is not a variable, it's a binding. To put it simply: with IORefs (and STRefs, MVars, ...) you have references to values that you can change (inside their respective monads), much like variables, but data declarations are values, not references to values (even if GHC stores them as pointers you cannot treat them as such), so you cannot update it. Sorry - maybe I'm unable to express using the correct terminology.. So I'll just give a small example how I think it could magically work?
data CD = CD { title :: String, tracks :: [ Track ] } data Track = Track { track :: String, cd :: CD } data PDB = PDB { cds :: Set CD, tracks :: Set Track } Let's fill the database with 1 track and a cd: 0x3 = pointer to DB rec 0x1: adress of CD 0x5: adress of Track 0x4, 0x9, 0x9: start adress of linked list connected by pointers.. In the final solution should use finger trees or such to speed up deletion / replacing elements 0x3 database: 0x8 cds : tuple1 0x1 : (0x6 "My song") (0x4 [ 0x5, ... ]) ^ pointer to str ^ pointer to track list, 0x5 = pointer to track 0x9 tracks: tuple1 0x5 : ( 0x7 "track 1") 0x1 ^ reference to cd Now I query the track, and "update" it (replacing the title).. It's a little bit tricky, because when updating the track I need to update the cd as well (circular referency). All new pointers are starting from 0x20 So in haskell it would look like this: let updatedCd = 0x22 CD (0x6 "My song") (0x20 ( 0x23 : ...) updatedTrack = 0x23 Track ( 0x21 "updated track title" ) 0x22 in (0x27) DB (0x24 (updatedCd:otherCds)) (0x25 (updatedTrack:otherTracks)) Now my new address to access the database is 0x25. So pretty every adress has been changed but 0x6, ..., otherCds and otherTracks A query running using db 0x3 will not notice any change on its snapshot. Are these actions called rebinding? Of course if you have a lot of relations writing this let in will become tedious and error prone.. That's why I'd like to use template haskell to automatically derive it. Thanks for listening Marc Weber

2008/5/22 Marc Weber
So in haskell it would look like this: let updatedCd = 0x22 CD (0x6 "My song") (0x20 ( 0x23 : ...) updatedTrack = 0x23 Track ( 0x21 "updated track title" ) 0x22 in (0x27) DB (0x24 (updatedCd:otherCds)) (0x25 (updatedTrack:otherTracks))
Mmmm I don't think that this is a good way to go. Let me do a counter-example: data A = A String data B = B String [A] data C = C String [B] data D = D String [C] Suppose to have some As, Bs, Cs, Ds in your database. Now you want to "update" the String of A. As you cannot "update" stuff in Haskell mantaining the same pointer, you've got a "new A". So you must find all Bs that had this A in their list, and update that. Unfortunately lists are not mutable too, so you are creating a new list; so you need to create new containing Bs too. But then you must change Cs... and so on. A little change like changing the String in A requires updating the whole "DB". Salvatore

On Thu, May 22, 2008 at 12:48:42PM +0200, Salvatore Insalaco wrote:
2008/5/22 Marc Weber
: So in haskell it would look like this: let updatedCd = 0x22 CD (0x6 "My song") (0x20 ( 0x23 : ...) updatedTrack = 0x23 Track ( 0x21 "updated track title" ) 0x22 in (0x27) DB (0x24 (updatedCd:otherCds)) (0x25 (updatedTrack:otherTracks))
Mmmm I don't think that this is a good way to go. Let me do a counter-example:
data A = A String data B = B String [A] data C = C String [B] data D = D String [C]
A little change like changing the String in A requires updating the whole "DB". You're right. Very bad idea unless you only insert once a year and only have queries the whole day. The only way to fix this is by separating relational data from record data. data A = Map RecordDataA RelationalDataA data B = Map RecordDataB RelationalDataB
So when changing a field in RecordDataA only the relational data B would have to be updated.. but I see that that's not that good either. Fine. Then the only way to go is using uniq ids as keys the way it's already done everywhere Thanks Marc W

"Salvatore Insalaco"
This doesn't look like a relational structure at all in Haskell.
I believe you are abusing terminology here. 'Relation' refers to a table (since it represents a subset of AxBxC.., i.e. a relation), not to references between tables. Mutability and mutability of references is of course important in most relational databases, but I'm not convinced an immutable database wouldn't be interesting and useful in a functional programming language. I've always (well not that I use them often) been annoyed at RDBMS lack of discriminated unions. The TH based approach by HAppS looks cool, but I think simply a slightly more general Data.Map (supporting multiple indices, search by named field and so on) could be a useful thing. -k -- If I haven't seen further, it is by standing in the footprints of giants

"Salvatore Insalaco"
writes: This doesn't look like a relational structure at all in Haskell.
I believe you are abusing terminology here. 'Relation' refers to a Yes. Sorry. I thought the relational in relational databases refers to references between tables. But you a right a relation is a set of rows
On Thu, May 22, 2008 at 10:56:03AM +0200, Ketil Malde wrote: forming a table. Thanks for clarifying. Thanks Marc

to whoever in this thread hasn't realized it: Map String (Map Int Foo) == Map (String,Int) Foo (at least to an approximation) -Isaac

On Thu, May 22, 2008 at 09:11:28AM -0400, Isaac Dupree wrote:
to whoever in this thread hasn't realized it: Map String (Map Int Foo) == Map (String,Int) Foo (at least to an approximation)
That's note quite right.. if you care only about the Int -> String -> [ Rec ] lookup you are right. But When using Map String (Map Int Rec) you can still use the first index also you'll have to spend some time joining all sets found in the second one.. Marc

On Thu, May 22, 2008 at 03:34:36PM +0200, Marc Weber wrote:
On Thu, May 22, 2008 at 09:11:28AM -0400, Isaac Dupree wrote:
to whoever in this thread hasn't realized it: Map String (Map Int Foo) == Map (String,Int) Foo (at least to an approximation) There is another difference if you want to query <,<=,>,>= say String = city and Int = age. Now take Map (Int, String) rec and use this index to filter all tuples having an age >= 80 and beeing "city"
Marc

Marc Weber wrote:
On Thu, May 22, 2008 at 03:34:36PM +0200, Marc Weber wrote:
On Thu, May 22, 2008 at 09:11:28AM -0400, Isaac Dupree wrote:
to whoever in this thread hasn't realized it: Map String (Map Int Foo) == Map (String,Int) Foo (at least to an approximation) There is another difference if you want to query <,<=,>,>= say String = city and Int = age. Now take Map (Int, String) rec and use this index to filter all tuples having an age >= 80 and beeing "city"
there are two problems: Data.Map doesn't have a very good API for that (splitLookup is about the best you can get for ranges) Whether tupled or not, the order of the two indices matters (Int,String) vs. (String,Int) for what you can look up efficiently. It's essentially a binary tree either way (Map x (Map y rec)) or (Map (x,y) rec), sorted in the same order. (tuples sort by lexicographical order) -Isaac
participants (6)
-
Dan Weston
-
Isaac Dupree
-
Jeremy Shaw
-
Ketil Malde
-
Marc Weber
-
Salvatore Insalaco