Re: [Haskell-cafe] implementing python-style dictionary in Haskell

18 Nov 2008

      2008/11/18 kenny lu :
...
Dear all,
I am trying to implement the python-style dictionary in Haskell.
Python dictionary is a data structure that maps one key to one value.
For instance, a python dictionary
d = {'a':1, 'b':2 }
maps key 'a' to 1, 'b' to 2.
Python dictionary allows for update. e.g. the statement
d['a'] = 3
changes the value pointed by 'a' from 1 to 3.
Internally, python dictionary is implemented using hash table.
My first attempt is to use Data.HashTable. However it was immediately
abandoned, as I realize the memory usage is unreasonably huge.
...
I tested all three implementations by building a dictionary of size 1000000.
The result shows that the Map and the Trie approaches handle collision well,
but
the IntMap approach does not.
Here is a comparison of memory usage
Map     : 345 MB
IntMap : 146 MB
Trie     : 282 MB
Python : 94 MB
Here is a comparison of execution time (on an intel dual core 2.0G)
Map: 26 sec
IntMap: 9 sec
Trie: 12 sec
Python: 2.24 sec
The above number shows that my implementations of python style dictionary
are space/time in-efficient as compared to python.
Can some one point out what's wrong with my implementations?
First of all, you use Strings. That's a very bad thing when you care
about memory restrictions. Fire up ghci type something like this:
...
let aas = replicate (1024*1024*10) 'a'
-- 22 Mb memory usage
length aas
10485760
-- 270 Mb memory usage
10 Mb string caused as much as 250 Mb increase in ghci's memory consumption.
My guess? Use hashtables with ByteStrings.
I rewrote part of your code. Results are quite promising.

Haskell:
121 Mb total memory in use

 INIT  time    0.02s  (  0.00s elapsed)
 MUT   time    0.84s  (  1.00s elapsed)
 GC    time    1.97s  (  2.02s elapsed)
 EXIT  time    0.00s  (  0.00s elapsed)
 Total time    2.83s  (  3.02s elapsed)

 %GC time      69.6%  (66.8% elapsed)
Python:
$ time python dict.py
256

real    0m2.278s
user    0m2.233s
sys     0m0.078s

memory: 101 Mb (as reported by Windows' Task Manager).

The code:

--- cut ---

import qualified Data.ByteString.Lazy.Char8 as BS
import Data.Int
import Data.Bits

(...)

parse_a_line_BS :: BS.ByteString -> (BS.ByteString,Int)
parse_a_line_BS line = case BS.words line of
		      [key,val] -> (key,(read . BS.unpack) val)
		      _ -> error " parse error.  "

main :: IO ()
main = do
  dict <- HT.new (==) hashByteString
  indata <- (map parse_a_line_BS `fmap` BS.lines `fmap` BS.readFile "in.txt")
  mapM_ (\ (k,v) -> HT.insert dict k v) indata
  HT.lookup dict (BS.pack "key256") >>= \v -> case v of
                                                Just vv -> putStrLn (show vv)
                                                Nothing -> putStrLn
("Not found")

-- derived from Data.HashTable.hashString
hashByteString :: BS.ByteString -> Int32
hashByteString = BS.foldl' f golden
   where f m c = fromIntegral (ord c) * magic + hashInt32 m
         magic = 0xdeadbeef

hashInt32 :: Int32 -> Int32
hashInt32 x = mulHi x golden + x

mulHi :: Int32 -> Int32 -> Int32
mulHi a b = fromIntegral (r `shiftR` 32)
   where r :: Int64
         r = fromIntegral a * fromIntegral b

golden :: Int32
golden = 1013904242 -- = round ((sqrt 5 - 1) * 2^32) :: Int32

--- cut ---

I had to rewrite hashString to work for ByteStrings - basically it's
just using different foldl'.

All best

Christopher Skrzętnicki

Re: [Haskell-cafe] implementing python-style dictionary in Haskell

Krzysztof Skrzętnicki