
I was hoping someone could direct me to material on how I might go about optimizing a small parsing program I wrote. Used for calculating differences between two files that store a hash of hashes (each line is a hash, each line has "key=value" pairs seperated by commas), the bottleneck seems to be in getting the data into memory (in a usable structure) quickly, as I'm working with extremely large files (>250,000 line, 100MB+ files are normal in size, some are quite larger, and I have to process a large number of them often). Extracting just the parsing elements of my current python implementation: FILE: compare.py
import sys def parse(lines): print "Parsing",len(lines),"lines..." failed = 0 hashes = {} for line in lines: try: hash = {} parts = line.split(",") for part in parts: k,v = part.split("=") hash[k] = v hashes[ hash["LogOffset"] ] = hash except: if line != "": failed += 1 if failed > 0: print "[ERROR] Failed to parse:",failed,"lines" print "...Parsing resulted in",len(hashes),"unique hashes" return hashes
def normalize(lines): lineset = set() for line in lines: if "Type=Heartbeat" == line[0:14]: pass elif line == "": pass else: lineset.add( line ) #use set to get only uniques return lineset
def main(): hashes = parse( normalize ( open(sys.argv[1]).readlines() ))
if __name__ == '__main__': main()
$ time python compare.py 38807lineFile Removed 52 bad lines Parsing 38807 lines... ...Parsing resulted in 38807 unique hashes real 0m3.807s user 0m3.330s sys 0m0.470s $ time python compare.py 255919lineFile Removed 0 bad lines Parsing 255919 lines... ...Parsing resulted in 255868 unique hashes real 0m30.889s user 0m23.970s sys 0m2.900s *note: profiling shows over 7.1 million calls to the split() function, to give you a good idea of the number of pairs the file contains. Once you factor in much increased filesizes, actually preforming the analysis, and running it on a few dozen files, my tests started to become quite time consuming (not to mention it takes 1GB of memory for just the 250K line files, although execution time is still much more important at the moment thanks to ram being cheap). Thusly, I figured I'd rewrite it in C, but first I wanted to give Haskell a shot, if only to see how it compared to python (hoping maybe I could convince my boss to let me use it more often if the results were good). The first thing I wanted to check was if parsec was a viable option. Without even putting the data into lists/maps, I found it too slow. FILE: compare_parsec.hs
{-# OPTIONS_GHC -O2 #-} module Main where import System.Environment (getArgs) import Text.ParserCombinators.Parsec
csv = do x <- record `sepEndBy` many1 (oneOf "\n") eof return x record = field `sepBy` char ',' field = many (anyToken)
main = do ~[filename] <- getArgs putStrLn "Parsing log..." res <- parseFromFile csv filename case res of Left err -> print err Right xs -> putStrLn "...Success"
$ time ./compare_parsec 38807lineFile Parsing log... ...Success real 0m13.809s user 0m11.560s sys 0m2.180s $ time ./compare_parsec 255919lineFile Parsing log... ...Success real 1m28.983s user 1m8.480s sys 0m9.530s This, sadly, is significantly worse than the python code above. Perhaps someone here can offer advice on more efficient use of parsec? Unfortunately I don't have profiling libraries for parsec available on this machine, nor have I had any luck finding material on the web. After this, I tried doing my own parsing, since the format is strict and regular. I tried pushing it to lists (fast, not very usable) and maps (much easier for the analysis stage and what the python code does, but much worse speedwise). FILE: compare_lists.hs
{-# OPTIONS_GHC -O2 -fglasgow-exts #-} module Main where import System.Environment (getArgs) type Field = (String,String) type Record = [Field] type Log = [Record]
main = do ~[filename1] <- getArgs file1 <- readFile filename1 putStrLn "Parsing file1..." let data1 = parseLog file1 print data1 putStrLn "...Done"
-- parse file parseLog :: String -> Log parseLog log = foldr f [] (lines log) where f "" a = a f "\n" a = a f x a = (parseRecord x):a -- parse record parseRecord :: String -> Record parseRecord record = foldr (\x a -> (parseField x):a) [] (split ',' record) -- parse field -- no error detection/handling now parseField :: String -> Field parseField s = (takeWhile isntCharEq s, tail $ dropWhile isntCharEq s)
isntCharEq :: Char -> Bool isntCharEq '=' = False isntCharEq _ = True
split :: Eq a => a -> [a] -> [[a]] split delim = foldr f [[]] where f x rest@(r:rs) | x == delim = [] : rest | otherwise = (x:r) : rs
I wasn't sure the best way to force evaluation on this, so I opt'd to print it and direct stdout to /dev/null (and noted how long a program that just read and print the same file to /dev/null took as well). Any better suggestions are welcome. $ time ./compare_lists 38807lineFile > /dev/null real 0m4.958s (0m0.445s for the putStrLn $ readFile program) user 0m4.830s sys 0m0.110s $ time ./compare_lists 255919lineFile > /dev/null real 0m31.047s (0m2.620s for the putStrLn $ readFile program) user 0m30.310s sys 0m0.690s These speeds are relatively the same as python, but the data is easily usable in comparision, so finally I implemented the useful version: FILE: compare_maps.hs
{-# OPTIONS_GHC -O2 -fglasgow-exts #-} module Main where import qualified Data.Map as Map import System.Environment (getArgs) import Control.Monad
type Field = (String,String) type Record = Map.Map String String type Log = Map.Map String Record
main = do ~[filename] <- getArgs d <- liftM parseLog $ readFile filename putStr $ strMaps d -- parse file parseLog :: String -> Log parseLog logstr = foldr f Map.empty (lines logstr) where f "" a = a f "\n" a = a f x a = let (k,v) = parseRecord1 x in Map.insert k v a -- parse record line parseRecord1 :: String -> (String,Record) parseRecord1 recordstr = do let r = parseRecord recordstr (idRecord r, r)
-- extract unique id used for comparision amoungst line hashes in other files idRecord :: Record -> String idRecord r = case (Map.lookup "LogID" r) of Nothing -> "ERROR" Just id -> id
parseRecord :: String -> Record parseRecord recordstr = foldr f Map.empty (split ',' recordstr) where f x a = case parseField x of Nothing -> a Just (k,v) -> Map.insert k v a -- parse k=v fields parseField :: String -> Maybe Field parseField "" = Nothing parseField s = Just (takeWhile isntCharEq s, tail $ dropWhile isntCharEq s)
isntCharEq :: Char -> Bool isntCharEq '=' = False isntCharEq _ = True -- map functions strMaps = Map.fold (\x a -> (strMap x) ++ "\n" ++ a) "" strMap = unlines . map (\(k,v) -> k ++ ":" ++ v) . Map.toAscList -- list functions split :: Eq a => a -> [a] -> [[a]] split delim = foldr f [[]] where f x rest@(r:rs) | x == delim = [] : rest | otherwise = (x:r) : rs
$ time ./compare_maps 38807lineFile > /dev/null real 0m24.207s (0m0.445s for the putStrLn $ readFile program) user 0m22.430s sys 0m1.680s $ time ./compare_maps 255919lineFile > /dev/null real 5m24.927s (0m2.620s for the putStrLn $ readFile program) user 5m5.240s sys 0m12.770s Until this point I had only been throwing on -O2. I believe -fvia-C is included with -O currently, and I'm not well versed enough to know what others could help me out, so I left it at that. Next I tried checking out what the garbage collector was doing (I assume the footprint is smaller than python, but that would still be quite large). Flipping on -Sstderr I saw that only a third of the time spent was doing real work (67.9% doing GC). For the 38K file: 2,794,465,788 bytes allocated in the heap 1,956,668,360 bytes copied during GC (scavenged) 21,884,948 bytes copied during GC (not scavenged) 185,541,764 bytes maximum residency (16 sample(s)) 5329 collections in generation 0 ( 9.03s) 16 collections in generation 1 ( 5.99s) 523 Mb total memory in use MUT time 7.10s ( 7.80s elapsed) GC time 15.02s ( 16.68s elapsed) Total time 22.12s ( 24.48s elapsed) %GC time 67.9% (68.1% elapsed) For the 250K file: 17,509,820,952 bytes allocated in the heap 12,270,870,276 bytes copied during GC (scavenged) 136,831,020 bytes copied during GC (not scavenged) 1,307,822,788 bytes maximum residency (21 sample(s)) 33390 collections in generation 0 (121.65s) 21 collections in generation 1 ( 42.37s) 3244 Mb total memory in use MUT time 146.25s (151.51s elapsed) GC time 164.02s (172.73s elapsed) Total time 310.27s (324.24s elapsed) %GC time 52.9% (53.3% elapsed) Not being paticularly knowledgeable in this area, I tried some options I thought were reasonable after reading the (http://www.haskell.org/ghc/docs/latest/html/users_guide/runtime-control .html#rts-options-gc). The following are with the small (38K) file. Running with a larger heap (-H) didn't seem to help: 256m 35s 512m 37s 768m 37s 1g 29s Playing with the allocation area size (-A) didn't help either: 128k 29s 256k 25s (default) 512k 23s 768k 25s 1m 23s 2m 32s 5m 32s 10m 32s The 250K file shows improvement from the little bit I've played around with defaults -> 5m24s, 33.3K collections, 53% GC time -H1g -> 4m53s, 16.8K collecctions, 56% GC time -H3g -A5m -> 4m16s total time, 3500 collections, 52% GC time -H4g -A50m -> 4m26s total time, 350 collections, 57% GC time (-H3g -A5m and -H4g -A50m give approx. 4m16s total time, 3500 collections, but still 52% GC time), so I'll try to run some more tests with different options for it, but this still isn't even remotely close to python's speed, so I assume improving the code would prove more fruitful. At this point I'm out of ideas, so I was hoping someone could identify something stupid I've done (I'm still novice of FP in general, let alone for high performance) or direct me to a guide,website,paper,library, or some other form of help. For reference, the tests above were done on an 8 proc (possibly 4 real but hyperthreaded?) 2.8ghz Xeon workstation with 16gb memory runing an SMP enabled 2.4.21-47 linux kernel and ghc 6.6.1 Thanks -- Joseph Re -------------------------------------------------------- NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.