
wren:
Neil Mitchell wrote:
2) The storage for String seems to be raw strings, which is nice. Would I get a substantial speedup by moving to bytestrings instead of strings? If I hashed the strings and stored common ones in a hash table is it likely to be a big win?
Bytestrings should help. The big wins in this application are likely to be cache issues, though the improved memory/GC overhead is nice too.
Here's a quick demo using Data.Binary directly. First, let's read in the dictionary file, and build a big, worst-case finite map of words to their occurence (always 1). import Data.Binary import Data.List import qualified Data.ByteString.Char8 as B import System.Environment import qualified Data.Map as M main = do [f] <- getArgs s <- B.readFile f let m = M.fromList [ (head n, length n) | n <- (group . B.lines $ s) ] encodeFile "dict" m print "done" So that writes a "dict" file with a binary encoded Map ByteString Int. Using ghc -O2 --make for everying. $ time ./A /usr/share/dict/cracklib-small "done" ./A /usr/share/dict/cracklib-small 0.28s user 0.03s system 94% cpu 0.331 total Yields a dictionary file Map: $ du -hs dict 1.3M dict Now, let's read back in and decode it back to a Map main = do [f] <- getArgs m <- decodeFile f print (M.size (m :: M.Map B.ByteString Int)) print "done" Easy enough: $ time ./A dict +RTS -K20M 52848 "done" ./A dict +RTS -K20M 1.51s user 0.06s system 99% cpu 1.582 total Ok. So 1.5s to decode a 1.3M Map. There may be better ways to build the Map since we know the input will be sorted, but the Data.Binary instance can't do that. Since decode/encode are a nice pure function on lazy bytestrings, we can try a trick of compressing/decompressing the dictionary in memory. Compressing the dictionary: import Data.Binary import Data.List import qualified Data.ByteString.Char8 as B import qualified Data.ByteString.Lazy as L import System.Environment import qualified Data.Map as M import Codec.Compression.GZip main = do [f] <- getArgs s <- B.readFile f let m = M.fromList [ (head n, length n) | n <- (group . B.lines $ s) ] L.writeFile "dict.gz" . compress . encode $ m print "done" Pretty cool, imo, is "compress . encode": $ time ./A /usr/share/dict/cracklib-small "done" ./A /usr/share/dict/cracklib-small 0.38s user 0.02s system 85% cpu 0.470 total Ok. So building a compressed dictionary takes only a bit longer than uncompressed one (zlib is fast), $ du -hs dict.gz 216K dict.gz Compressed dictionary is much smaller. Let's load it back in and unpickle it: main = do [f] <- getArgs m <- (decode . decompress) `fmap` L.readFile f print (M.size (m :: M.Map B.ByteString Int)) print "done" Also cute. But how does it run: $ time ./A dict.gz 52848 "done" ./A dict.gz 0.28s user 0.03s system 98% cpu 0.310 total Interesting. So extracting the Map from a compressed bytestring in memory is a fair bit faster than loading it directly, uncompressed from disk. Neil, does that give you some ballpark figures to work with? -- Don