
dons:
wren:
Neil Mitchell wrote:
2) The storage for String seems to be raw strings, which is nice. Would I get a substantial speedup by moving to bytestrings instead of strings? If I hashed the strings and stored common ones in a hash table is it likely to be a big win?
Bytestrings should help. The big wins in this application are likely to be cache issues, though the improved memory/GC overhead is nice too.
Here's a quick demo using Data.Binary directly.
Now, let's read back in and decode it back to a Map
main = do [f] <- getArgs m <- decodeFile f print (M.size (m :: M.Map B.ByteString Int)) print "done"
Easy enough:
$ time ./A dict +RTS -K20M 52848 "done" ./A dict +RTS -K20M 1.51s user 0.06s system 99% cpu 1.582 total
Compressed dictionary is much smaller. Let's load it back in and unpickle it:
main = do [f] <- getArgs m <- (decode . decompress) `fmap` L.readFile f print (M.size (m :: M.Map B.ByteString Int)) print "done"
Also cute. But how does it run:
$ time ./A dict.gz 52848 "done" ./A dict.gz 0.28s user 0.03s system 98% cpu 0.310 total
Interesting. So extracting the Map from a compressed bytestring in memory is a fair bit faster than loading it directly, uncompressed from disk.
Note the difference, as Duncan and Bulat pointed out, is a bit surprising. Perhaps the Map instance is a bit weird? We already know that bytestring IO is fine. Just serialising straight lists of pairs, import Data.Binary import Data.List import qualified Data.ByteString.Char8 as B import qualified Data.ByteString.Lazy as L import System.Environment import qualified Data.Map as M import Codec.Compression.GZip main = do [f] <- getArgs s <- B.readFile f let m = [ (head n, length n) | n <- (group . B.lines $ s) ] L.writeFile "dict.gz" . encode $ m print "done" $ time ./binary /usr/share/dict/cracklib-small "done" ./binary /usr/share/dict/cracklib-small 0.13s user 0.04s system 99% cpu 0.173 total $ du -hs dict 1.3M dict And reading them back in, main = do [f] <- getArgs m <- decode `fmap` L.readFile f print (length (m :: [(B.ByteString,Int)])) print "done" $ time ./binary dict 52848 "done" ./binary dict 0.04s user 0.01s system 99% cpu 0.047 total Is fast. So there's some complication in the Map serialisation. Adding in zlib, to check, main = do [f] <- getArgs s <- B.readFile f let m = [ (head n, length n) | n <- (group . B.lines $ s) ] L.writeFile "dict.gz" . compress . encode $ m print "done" $ time ./binary /usr/share/dict/cracklib-small "done" ./binary /usr/share/dict/cracklib-small 0.25s user 0.03s system 100% cpu 0.277 total Compression takes longer, as expected, and reading it back in, main = do [f] <- getArgs m <- (decode . decompress) `fmap` L.readFile f print (length (m :: [(B.ByteString,Int)])) print "done" $ time ./binary dict.gz 52848 "done" ./binary dict.gz 0.03s user 0.01s system 98% cpu 0.040 total About the same. Looks like the Map reading/showing via association lists could do with further work. Anyone want to dig around in the Map instance? (There's also some patches for an alternative lazy Map serialisation, if people are keen to load maps -- happstack devs?). -- Don