Dear all,
I am trying to implement the python-style dictionary in Haskell.
Python dictionary is a data structure that maps one key to one value.
For instance, a python dictionary
d = {'a':1, 'b':2 }
maps key 'a' to 1, 'b' to 2.
Python dictionary allows for update. e.g. the statement
d['a'] = 3
changes the value pointed by 'a' from 1 to 3.
Internally, python dictionary is implemented using hash table.
My first attempt is to use Data.HashTable. However it was immediately abandoned, as I realize the memory usage is unreasonably huge.
== SECOND ATTEMPT ==
My second attempt is to use Data.Map
{-# OPTIONS_GHC -fglasgow-exts #-}
module Main where
import qualified Data.HashTable as HT
import qualified Data.IntMap as IM
import qualified Data.Map as DM
import qualified Data.ByteString.Char8 as S
import Data.Char
-- the Dict type class
class Dict d k v | d -> k v where
empty :: d
insert :: k -> v -> d -> d
lookup :: k -> d -> Maybe v
update :: k -> v -> d -> d
-- Let's use string as key
type Key = String
-- insert key-value pairs into a dictionary
fromList :: Dict d k a => [(k,a)] -> d
fromList l =
foldl (\d (key,val) -> insert key val d) empty l
instance Dict (DM.Map S.ByteString a) Key a where
empty = DM.empty
insert key val dm =
let packed_key = S.pack key
in DM.insert packed_key val dm
lookup key dm =
let packed_key = S.pack key
in DM.lookup packed_key dm
update key val dm =
let packed_key = S.pack key
in DM.update (\x -> Just val) packed_key dm
Which kinda works, however since Map is implemented using a balanced tree, therefore,
when as the dictionary grows, it takes a long time to insert new key-value pair.
== THIRD ATTEMPT ==
My third attempt is to use Data.IntMap
-- an implementation of Dict using IntMap
instance Dict (IM.IntMap a) Key a where
empty = IM.empty
insert key val im =
let int_key = fromIntegral (HT.hashString key)
in IM.insert int_key val im
lookup key im =
let int_key = fromIntegral (HT.hashString key)
in IM.lookup int_key im
update key val im =
let int_key = fromIntegral (HT.hashString key)
in IM.update (\x -> Just val) int_key im
This implementation is faster than the Map approach, however this implementation
can't handle collision well, two keys which are hashed into the same integer will overwrite each other.
== FOURTH ATTEMPT ==
My fourth implementation is to use Trie. The idea is to split a string (a key) into
a list of 4-character chunks. Each chunk can be mapped into a 32-bit integer without collision.
We then insert the value with this list of chunks into the Trie.
-- an implementation of Dict using Trie
instance Dict (Trie a) Key a where
empty = emptyTrie
insert key val trie =
let key_chain = chain key
in insertTrie key_chain val trie
lookup key trie =
let key_chain = chain key
in lookupTrie key_chain trie
update key val trie =
let key_chain = chain key
in updateTrie key_chain val trie
-- an auxillary function that "splits" string into small pieces,
-- 4 characters per piece, 4 chars = 32 bit
chain :: Key -> [Key]
chain k | length k > 4 = let (k',ks) = splitAt 4 k
in (k':chain ks)
| otherwise = [k]
-- a collision-free hash function which turns four chars into Int32
safehash :: [Char] -> Int
safehash cs | length cs > 4 = error "safehash failed."
| otherwise =
sum [ (ord c)*(256^i) | (c,i) <- zip cs [0..3] ]
-- a trie datatype
data Trie a = Trie [a] (IM.IntMap (Trie a))
-- the empty trie
emptyTrie = Trie [] (IM.empty)
-- insert value into the trie
insertTrie :: [String] -> a -> Trie a -> Trie a
insertTrie [] i (Trie is maps) = Trie (i:is) maps
insertTrie (word:words) i (Trie is maps) =
let key = safehash word
in case IM.lookup key maps of
{ Just trie -> let trie' = insertTrie words i trie
maps' = IM.update (\x -> Just trie') key maps
in Trie is maps'
; Nothing -> let trie = emptyTrie
trie' = insertTrie words i trie
maps' = IM.insert key trie' maps
in Trie is maps'
}
-- lookup value from the trie
lookupTrie :: [String] -> Trie a -> Maybe a
lookupTrie [] (Trie vs _) =
case vs of
[] -> Nothing
(x:_) -> Just x
lookupTrie (word:words) (Trie is maps) =
let key = safehash word
in case IM.lookup key maps of
Just trie -> lookupTrie words trie
Nothing -> Nothing
-- update the trie with the given value.
updateTrie :: [String] -> a -> Trie a -> Trie a
-- we only update the first value and leave the rest unchanged.
updateTrie [] y (Trie (x:xs) maps) = Trie (y:xs) maps
updateTrie (word:words) v (Trie is maps) =
let key = safehash word
in case IM.lookup key maps of
Just trie -> let trie' = updateTrie words v trie
maps' = IM.update (\x -> Just trie') key maps
in Trie is maps'
Nothing -> Trie is maps
== BENCH MARK ==
I have a main function which builds a dictionary from a text file.
Each line of the file is a key-value pair separated by a space.
e.g.
key1 1
key2 2
...
main :: IO ()
main = do { content <- readFile "in.txt"
; let -- change this following type annotation
-- to change different type of the dictionary
-- dict :: DM.Map S.ByteString Int
-- dict :: IM.IntMap Int
dict :: Trie Int
dict = fromList (map parse_a_line (lines content))
; case Main.lookup "key256" dict of
{ Just v -> putStrLn (show v)
; Nothing -> putStrLn "Not found"
}
-- read a line here so that we can pause the program
-- and look at the memory usage.
; v <- readLn
; putStrLn v
}
where parse_a_line :: String -> (Key,Int)
parse_a_line line = case words line of
[key,val] -> (key,read val)
_ -> error " parse error. "
I tested all three implementations by building a dictionary of size 1000000.
The result shows that the Map and the Trie approaches handle collision well, but
the IntMap approach does not.
Here is a comparison of memory usage
Map : 345 MB
IntMap : 146 MB
Trie : 282 MB
Python : 94 MB
Here is a comparison of execution time (on an intel dual core 2.0G)
Map: 26 sec
IntMap: 9 sec
Trie: 12 sec
Python: 2.24 sec
The above number shows that my implementations of python style dictionary are space/time in-efficient as compared to python.
Can some one point out what's wrong with my implementations?
I've attached my code in the tgz file.
Cheers,
Kenny