
Which version of GHC and which version of the Data.ByteString library? There was an inlining bug related to Data.Map /Data.IntMap performance fixed between the 6.8.x release and the current bytestring release. In testing, Data.Map with strict bytestring keys matched the python (C implemented) dictionary, after I fixed the inlining for word lookups. You'll need to be using bytestring 0.9.1.x though. -- Don haskellmail:
Dear all,
I am trying to implement the python-style dictionary in Haskell.
Python dictionary is a data structure that maps one key to one value. For instance, a python dictionary d = {'a':1, 'b':2 } maps key 'a' to 1, 'b' to 2. Python dictionary allows for update. e.g. the statement d['a'] = 3 changes the value pointed by 'a' from 1 to 3.
Internally, python dictionary is implemented using hash table.
My first attempt is to use Data.HashTable. However it was immediately abandoned, as I realize the memory usage is unreasonably huge.
== SECOND ATTEMPT == My second attempt is to use Data.Map
{-# OPTIONS_GHC -fglasgow-exts #-} module Main where
import qualified Data.HashTable as HT import qualified Data.IntMap as IM import qualified Data.Map as DM import qualified Data.ByteString.Char8 as S import Data.Char
-- the Dict type class class Dict d k v | d -> k v where empty :: d insert :: k -> v -> d -> d lookup :: k -> d -> Maybe v update :: k -> v -> d -> d
-- Let's use string as key type Key = String
-- insert key-value pairs into a dictionary fromList :: Dict d k a => [(k,a)] -> d fromList l = foldl (\d (key,val) -> insert key val d) empty l
instance Dict (DM.Map S.ByteString a) Key a where empty = DM.empty insert key val dm = let packed_key = S.pack key in DM.insert packed_key val dm lookup key dm = let packed_key = S.pack key in DM.lookup packed_key dm update key val dm = let packed_key = S.pack key in DM.update (\x -> Just val) packed_key dm
Which kinda works, however since Map is implemented using a balanced tree, therefore, when as the dictionary grows, it takes a long time to insert new key-value pair.
== THIRD ATTEMPT == My third attempt is to use Data.IntMap
-- an implementation of Dict using IntMap instance Dict (IM.IntMap a) Key a where empty = IM.empty insert key val im = let int_key = fromIntegral (HT.hashString key) in IM.insert int_key val im lookup key im = let int_key = fromIntegral (HT.hashString key) in IM.lookup int_key im update key val im = let int_key = fromIntegral (HT.hashString key) in IM.update (\x -> Just val) int_key im
This implementation is faster than the Map approach, however this implementation can't handle collision well, two keys which are hashed into the same integer will overwrite each other.
== FOURTH ATTEMPT ==
My fourth implementation is to use Trie. The idea is to split a string (a key) into a list of 4-character chunks. Each chunk can be mapped into a 32-bit integer without collision. We then insert the value with this list of chunks into the Trie.
-- an implementation of Dict using Trie instance Dict (Trie a) Key a where empty = emptyTrie insert key val trie = let key_chain = chain key in insertTrie key_chain val trie lookup key trie = let key_chain = chain key in lookupTrie key_chain trie update key val trie = let key_chain = chain key in updateTrie key_chain val trie
-- an auxillary function that "splits" string into small pieces, -- 4 characters per piece, 4 chars = 32 bit chain :: Key -> [Key] chain k | length k > 4 = let (k',ks) = splitAt 4 k in (k':chain ks) | otherwise = [k]
-- a collision-free hash function which turns four chars into Int32 safehash :: [Char] -> Int safehash cs | length cs > 4 = error "safehash failed." | otherwise = sum [ (ord c)*(256^i) | (c,i) <- zip cs [0..3] ]
-- a trie datatype data Trie a = Trie [a] (IM.IntMap (Trie a))
-- the empty trie emptyTrie = Trie [] (IM.empty)
-- insert value into the trie insertTrie :: [String] -> a -> Trie a -> Trie a insertTrie [] i (Trie is maps) = Trie (i:is) maps insertTrie (word:words) i (Trie is maps) = let key = safehash word in case IM.lookup key maps of { Just trie -> let trie' = insertTrie words i trie maps' = IM.update (\x -> Just trie') key maps in Trie is maps' ; Nothing -> let trie = emptyTrie trie' = insertTrie words i trie maps' = IM.insert key trie' maps in Trie is maps' }
-- lookup value from the trie lookupTrie :: [String] -> Trie a -> Maybe a lookupTrie [] (Trie vs _) = case vs of [] -> Nothing (x:_) -> Just x lookupTrie (word:words) (Trie is maps) = let key = safehash word in case IM.lookup key maps of Just trie -> lookupTrie words trie Nothing -> Nothing
-- update the trie with the given value. updateTrie :: [String] -> a -> Trie a -> Trie a -- we only update the first value and leave the rest unchanged. updateTrie [] y (Trie (x:xs) maps) = Trie (y:xs) maps updateTrie (word:words) v (Trie is maps) = let key = safehash word in case IM.lookup key maps of Just trie -> let trie' = updateTrie words v trie maps' = IM.update (\x -> Just trie') key maps in Trie is maps' Nothing -> Trie is maps
== BENCH MARK ==
I have a main function which builds a dictionary from a text file. Each line of the file is a key-value pair separated by a space.
e.g.
key1 1 key2 2 ...
main :: IO () main = do { content <- readFile "in.txt" ; let -- change this following type annotation -- to change different type of the dictionary -- dict :: DM.Map S.ByteString Int -- dict :: IM.IntMap Int dict :: Trie Int dict = fromList (map parse_a_line (lines content)) ; case Main.lookup "key256" dict of { Just v -> putStrLn (show v) ; Nothing -> putStrLn "Not found" } -- read a line here so that we can pause the program -- and look at the memory usage. ; v <- readLn ; putStrLn v } where parse_a_line :: String -> (Key,Int) parse_a_line line = case words line of [key,val] -> (key,read val) _ -> error " parse error. "
I tested all three implementations by building a dictionary of size 1000000. The result shows that the Map and the Trie approaches handle collision well, but the IntMap approach does not.
Here is a comparison of memory usage
Map : 345 MB IntMap : 146 MB Trie : 282 MB Python : 94 MB
Here is a comparison of execution time (on an intel dual core 2.0G)
Map: 26 sec IntMap: 9 sec Trie: 12 sec Python: 2.24 sec
The above number shows that my implementations of python style dictionary are space/time in-efficient as compared to python.
Can some one point out what's wrong with my implementations?
I've attached my code in the tgz file.
Cheers, Kenny
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe