
2008/11/18 kenny lu
Dear all,
I am trying to implement the python-style dictionary in Haskell.
Python dictionary is a data structure that maps one key to one value. For instance, a python dictionary d = {'a':1, 'b':2 } maps key 'a' to 1, 'b' to 2. Python dictionary allows for update. e.g. the statement d['a'] = 3 changes the value pointed by 'a' from 1 to 3.
Internally, python dictionary is implemented using hash table.
My first attempt is to use Data.HashTable. However it was immediately abandoned, as I realize the memory usage is unreasonably huge.
...
I tested all three implementations by building a dictionary of size 1000000. The result shows that the Map and the Trie approaches handle collision well, but the IntMap approach does not.
Here is a comparison of memory usage
Map : 345 MB IntMap : 146 MB Trie : 282 MB Python : 94 MB
Here is a comparison of execution time (on an intel dual core 2.0G)
Map: 26 sec IntMap: 9 sec Trie: 12 sec Python: 2.24 sec
The above number shows that my implementations of python style dictionary are space/time in-efficient as compared to python.
Can some one point out what's wrong with my implementations?
First of all, you use Strings. That's a very bad thing when you care about memory restrictions. Fire up ghci type something like this:
let aas = replicate (1024*1024*10) 'a' -- 22 Mb memory usage length aas 10485760 -- 270 Mb memory usage 10 Mb string caused as much as 250 Mb increase in ghci's memory consumption.
My guess? Use hashtables with ByteStrings. I rewrote part of your code. Results are quite promising. Haskell: 121 Mb total memory in use INIT time 0.02s ( 0.00s elapsed) MUT time 0.84s ( 1.00s elapsed) GC time 1.97s ( 2.02s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.83s ( 3.02s elapsed) %GC time 69.6% (66.8% elapsed) Python: $ time python dict.py 256 real 0m2.278s user 0m2.233s sys 0m0.078s memory: 101 Mb (as reported by Windows' Task Manager). The code: --- cut --- import qualified Data.ByteString.Lazy.Char8 as BS import Data.Int import Data.Bits (...) parse_a_line_BS :: BS.ByteString -> (BS.ByteString,Int) parse_a_line_BS line = case BS.words line of [key,val] -> (key,(read . BS.unpack) val) _ -> error " parse error. " main :: IO () main = do dict <- HT.new (==) hashByteString indata <- (map parse_a_line_BS `fmap` BS.lines `fmap` BS.readFile "in.txt") mapM_ (\ (k,v) -> HT.insert dict k v) indata HT.lookup dict (BS.pack "key256") >>= \v -> case v of Just vv -> putStrLn (show vv) Nothing -> putStrLn ("Not found") -- derived from Data.HashTable.hashString hashByteString :: BS.ByteString -> Int32 hashByteString = BS.foldl' f golden where f m c = fromIntegral (ord c) * magic + hashInt32 m magic = 0xdeadbeef hashInt32 :: Int32 -> Int32 hashInt32 x = mulHi x golden + x mulHi :: Int32 -> Int32 -> Int32 mulHi a b = fromIntegral (r `shiftR` 32) where r :: Int64 r = fromIntegral a * fromIntegral b golden :: Int32 golden = 1013904242 -- = round ((sqrt 5 - 1) * 2^32) :: Int32 --- cut --- I had to rewrite hashString to work for ByteStrings - basically it's just using different foldl'. All best Christopher Skrzętnicki