[Haskell-cafe] relational data representation in memory using haskell?

21 May 2008

      I'm kind of stuck that's why I'm posting here to ask wether this makes
sense at all, maybe someone else has already done it?

What I'd like to have:

Some way representing relational data which is typically stored in
databases such as Postgresql..

Rewriting something like Postgresql in haskell would take ages..
So I'd be satisfied with having in memory representation only (this
would fit the HAppS state system very well .. :)
Why ?
* type safety
* less conversions compared to SQL data
* no need to switch processes, parse SQL etc so maybe it's even faster?
  (a small benchmark showed that inserting 20000 Ints into a list was 8
   times faster than using MySQL parsing 20000 INSERT INTO x (1)
   statements )

I'd like to illustrate two different ideas using a small example:
(A)
        data CD = CD { title :: String, tracks :: [ Track ] }
        data Track = Track { track :: String, cd :: CD }
        data PDB = PDB { cds :: Set CD, tracks :: Set Track }

because it's not using foreign ids but kind of pointers I'll call this
the pointer method

using uniq ids it would look like this:
(B)
        data CD = CD { id : Int,  title :: String, tracks :: [Int ] }
        data Track = Track { trackId :: Int,  track :: String, cd :: Int }
        data IDB = IDB { cds :: Map Int CD, tracks :: Map Int Track }
I will call it I DB (I = using ids)

PDB: pro : * less work when doing joins (no need to look foreign rows up)
     con : * you need uniq ids or such when serializing to disk
           * When updating a track you'll also have to update the pointer
             stored in cds. and if you had another table shelfs.. this
             had to be updated as well..

IDB: the other way round

I find the idea not using any lookups when using joins appealing.

Of course having a simple
        data Table = Table Map UniqId Rec 

isn't enough, sometimes you need more than one index or even a multi index:
        data Table = Table { byId :: Map Int Rec
                     , byNameAndAge :: Map String (Map Int (Set Rec)) }

Note that I've used Set here as well because this index does'nt have to
be uniq! starting to write an
insertTable :: Table -> Rec -> Table
more than twice is getting tedious..

Of course you can start using some type hackery to insert a rec
into all maps automatically.. but you'll get into trouble making
the type system apply the best index not the first matching one.
(I bet this could be done using HList etc somehow as well.. )
So my current attempt is defining the database using some data types and
make template haskell derive those insertIntoTable and update functions.

I've added the draft below. But before continuing spending much time on
it I'd like to get your advice: Is there a chance that it will pay off?

Some general considerations:
haskell solution con:
        haskell can get close to C but in general it may be >10 times slower when
        not caring too much about design or writing low level (see recent thread
        about md5 or one where David Roundy has said something about a matrix thread:
        only 10 times slower?)

        Using a garbage collector on database data (some hundred MB)
        might not be the optimal way because I feel you can tell exactly
        when you no longer need a piece of allocated memory here?
        So some time might be wasted.

        projects tend to run longer as expected.. And if data no longer
        fits into memory .. :(... -> bad performance
        I think systems such as postgresql do scale much better if you
        have some gbs of data and only use the most recent X records
        frequently.. So maybe you'll have to spend time later which
        you've won by using a haskell relational data representation in
        memory only.. Another solution: use clusters - I don't have any
        experience.

pro:
        much more safety (STM, type system ..) there are less
        possibilities making compared to C / PHP etc

Do you also think (A) is more interesting because some load (looking up
foreign keys) is moved on insert / delete and update operations taking
less time in but are called more frequently thus maybe reducing peak
load on queries?

Of course some time would have to be spend on queries wich might
look like this:
let queryresult = $(query ( tables + constraints + relations ) ) db
automatically generating the query function taking into account expected
index cardinality etc..

Any comments, suggestions, links to existing solutions (except coddfish,
haskelldb) ?

Marc Weber

draft
============= types represeting tables and db ========================
module RDMH.Types where
import Language.Haskell.TH
import Language.Haskell.TH.Syntax

data Uniqueness = Uniq | NotUniq deriving (Show, Eq)
data ModifyMode = InsertOnly | UpdateInsert | UpdateInsertDelete deriving (Show, Eq)

type TypeS = String -- a name of a data type (data A = ..)

data Index = I {
    uniqueness :: Uniqueness
    , key :: Exp  -- a fuction rec -> key
    , subkeys ::  [ Index ]
    , keyType ::  TypeS  -- the key type (this *should* be determined by the function.. But I don't know yet how to do it.. Probably using some GHC API magic 
    -- , expectedCardinality :: Int -- how should this look like exactly? This will be used to select the best indexes 
  } deriving (Show, Eq)

  -- Int is negative weight / cardinality. 
  -- If you issue a query having two contstraints on types beeing indexed the one having biggest cardinality is taken first
  -- first type:`

-- convinient function to get the type of a name - see TestCase.hs 
t = ConT . mkName -- type (either key or row) 
k = VarE . mkName -- gets the key 
-- app (VarE $ mkName "fst" )

-- a table representation 
data Table = Table {
      tableName :: String
    , row :: TypeS -- this row type must be an instance of Eq 
    , indexes :: [ Index ] --  [ ( [  mkName "Int", mkName "String" ]] is represented as Map Int ( Map String rec )
    -- , history :: Bool -- to be implemented. Each time a record is updated the original row should be saved somehow somewhere 
    , modify :: ModifyMode
    -- , insertTimeStamp :: Bool
    -- , updateTimeStamp :: Bool
  }
  deriving (Eq, Show)

data DB = DB {
    dbName :: String
  , tables :: [ Table ]
  , oneToN :: [ (Table, Table, String)  ] 
  -- String must be set if you want two relations to the same table (eg one ticket has foreign inbound and outbound flight)
  -- , nToM :: [ ( Table, Table ) ]
}
============= example use case =======================================

{-# OPTIONS_GHC -XTemplateHaskell #-}
module TestCase where
import Language.Haskell.TH
import RDMH.Types
import RDMH.TH
import Data.Time

-- the data 
data RCD = CD { title :: String
             , artist :: String
             , year :: Int
             }
data RTrack = Track {
               trackTitle :: String
             , recordingDate :: UTCTime 
             }

-- a track table with two indexes title and recordingDate 
trackT = Table "track" 
                 (t "RTrack") -- row type
                 -- indexes: 
                 [ I NotUniq (k "trackTitle" ) [] (t "String")
                 , I NotUniq (k "recordingDate" ) [] (t "UTCTime")
                 ]
                 UpdateInsertDelete
-- a cd table with two indexes title and year 
cdT = Table "cd" 
                 (t "RCD" ) -- row type 
                 -- indexes: 
                 [ I Uniq (k "title" ) [] (t "String")
                 , I NotUniq (k "year" ) [] (t "Int")
                 ]
                 UpdateInsertDelete

-- the db: both tables and a simple relation 
db = DB "cdDB"
        [ cdT, trackT ]
        [ OneToN cdT trackT "" ) ]

$(createDB db)