New subject: Re[2]: implementing python-style dictionary in Haskell

18 Nov 2008

      Dear all,

I am trying to implement the python-style dictionary in Haskell.

Python dictionary is a data structure that maps one key to one value.
For instance, a python dictionary
d = {'a':1, 'b':2 }
maps key 'a' to 1, 'b' to 2.
Python dictionary allows for update. e.g. the statement
d['a'] = 3
changes the value pointed by 'a' from 1 to 3.

Internally, python dictionary is implemented using hash table.

My first attempt is to use Data.HashTable. However it was immediately
abandoned, as I realize the memory usage is unreasonably huge.

== SECOND ATTEMPT ==
My second attempt is to use Data.Map

{-# OPTIONS_GHC -fglasgow-exts #-}
module Main where

import qualified Data.HashTable as HT
import qualified Data.IntMap as IM
import qualified Data.Map as DM
import qualified Data.ByteString.Char8 as S
import Data.Char

-- the Dict type class
class Dict d k v | d -> k v where
    empty :: d
    insert :: k -> v -> d -> d
    lookup :: k -> d -> Maybe v
    update :: k -> v -> d -> d

-- Let's use string as key
type Key = String

-- insert key-value pairs into a dictionary
fromList :: Dict d k a => [(k,a)] -> d
fromList l =
    foldl (\d (key,val) -> insert key val d) empty l

instance Dict (DM.Map S.ByteString a) Key a where
    empty = DM.empty
    insert key val dm =
    let packed_key = S.pack key
    in DM.insert packed_key val dm
    lookup key dm =
    let packed_key = S.pack key
        in DM.lookup packed_key dm
    update key val dm =
    let packed_key = S.pack key
        in DM.update (\x -> Just val) packed_key dm

Which kinda works, however since Map is implemented using a balanced tree,
therefore,
when as the dictionary grows, it takes a long time to insert new key-value
pair.

== THIRD ATTEMPT ==
My third attempt is to use Data.IntMap

-- an implementation of Dict using IntMap
instance Dict (IM.IntMap a) Key a where
    empty = IM.empty
    insert key val im =
    let int_key = fromIntegral (HT.hashString key)
    in IM.insert int_key val im
    lookup key im =
    let int_key = fromIntegral (HT.hashString key)
    in IM.lookup int_key im
    update key val im =
    let int_key = fromIntegral (HT.hashString key)
        in IM.update (\x -> Just val) int_key im

This implementation is faster than the Map approach, however this
implementation
can't handle collision well, two keys which are hashed into the same integer
will overwrite each other.

== FOURTH ATTEMPT ==

My fourth implementation is to use Trie. The idea is to split a string (a
key) into
a list of 4-character chunks. Each chunk can be mapped into a 32-bit integer
without collision.
We then insert the value with this list of chunks into the Trie.

-- an implementation of Dict using Trie
instance Dict (Trie a) Key a where
    empty = emptyTrie
    insert key val trie =
    let key_chain = chain key
    in insertTrie key_chain val trie
    lookup key trie =
    let key_chain = chain key
        in lookupTrie key_chain trie
    update key val trie =
    let key_chain = chain key
        in updateTrie key_chain val trie

-- an auxillary function that "splits" string into small pieces,
-- 4 characters per piece, 4 chars = 32 bit
chain :: Key -> [Key]
chain k | length k > 4 = let (k',ks) = splitAt 4 k
             in (k':chain ks)
    | otherwise    = [k]

-- a collision-free hash function which turns four chars into Int32
safehash :: [Char] -> Int
safehash cs | length cs > 4 = error "safehash failed."
        | otherwise =
        sum [ (ord c)*(256^i)   | (c,i) <- zip cs [0..3] ]

-- a trie datatype
data Trie a = Trie [a] (IM.IntMap (Trie a))

-- the empty trie
emptyTrie = Trie [] (IM.empty)

-- insert value into the trie
insertTrie :: [String] -> a -> Trie a -> Trie a
insertTrie [] i (Trie is maps) = Trie (i:is) maps
insertTrie (word:words) i (Trie is maps) =
    let key = safehash word
    in case IM.lookup key maps of
     { Just trie -> let trie' = insertTrie words i trie
                maps' = IM.update (\x -> Just trie') key maps
            in Trie is maps'
     ; Nothing -> let trie = emptyTrie
              trie' = insertTrie words i trie
              maps' = IM.insert key trie' maps
              in Trie is maps'
     }

-- lookup value from the trie
lookupTrie :: [String] -> Trie a -> Maybe a
lookupTrie [] (Trie vs _) =
    case vs of
      [] -> Nothing
      (x:_) -> Just x
lookupTrie (word:words) (Trie is maps) =
    let key = safehash word
    in case IM.lookup key maps of
       Just trie -> lookupTrie words trie
       Nothing   -> Nothing

-- update the trie with the given value.
updateTrie :: [String] -> a -> Trie a -> Trie a
-- we only update the first value and leave the rest unchanged.
updateTrie [] y (Trie (x:xs) maps) = Trie (y:xs) maps
updateTrie (word:words) v  (Trie is maps) =
    let key = safehash word
    in case IM.lookup key maps of
       Just trie -> let trie' = updateTrie words v trie
                maps'  = IM.update (\x -> Just trie') key maps
            in Trie is maps'
       Nothing   -> Trie is maps

== BENCH MARK ==

I have a main function which builds a dictionary from a text file.
Each line of the file is a key-value pair separated by a space.

e.g.

key1 1
key2 2
...

main :: IO ()
main = do { content <- readFile "in.txt"
      ; let -- change this following type annotation
            -- to change different type of the dictionary
            -- dict :: DM.Map S.ByteString Int
            -- dict :: IM.IntMap Int
            dict :: Trie Int
            dict = fromList (map parse_a_line  (lines content))
      ; case Main.lookup "key256" dict of
        { Just v -> putStrLn (show v)
        ; Nothing -> putStrLn "Not found"
        }
        -- read a line here so that we can pause the program
            -- and look at the memory usage.
      ; v <- readLn
      ; putStrLn v
      }
    where  parse_a_line :: String -> (Key,Int)
           parse_a_line line = case words line of
                  [key,val] -> (key,read val)
                  _ -> error " parse error.  "

I tested all three implementations by building a dictionary of size 1000000.
The result shows that the Map and the Trie approaches handle collision well,
but
the IntMap approach does not.

Here is a comparison of memory usage

Map     : 345 MB
IntMap : 146 MB
Trie     : 282 MB
Python : 94 MB

Here is a comparison of execution time (on an intel dual core 2.0G)

Map: 26 sec
IntMap: 9 sec
Trie: 12 sec
Python: 2.24 sec

The above number shows that my implementations of python style dictionary
are space/time in-efficient as compared to python.

Can some one point out what's wrong with my implementations?

I've attached my code in the tgz file.

Cheers,
Kenny

implementing python-style dictionary in Haskell

kenny lu

Jason Dusek

kenny lu

Bulat Ziganshin

kenny lu

Bulat Ziganshin

Claus Reinke

Tillmann Rendel

Bulat Ziganshin

Jan-Willem Maessen

Ketil Malde

Claus Reinke

Krzysztof Skrzętnicki

Don Stewart

kenny lu

Don Stewart

David Menendez

Luke Palmer

David Menendez

Ryan Ingram

Luke Palmer

Janis Voigtlaender

Ryan Ingram

Jason Dusek

Ryan Ingram

Janis Voigtlaender

Don Stewart

David Menendez

Don Stewart

tags

participants (13)