
I was hoping someone could direct me to material on how I might go about optimizing a small parsing program I wrote. Used for calculating differences between two files that store a hash of hashes (each line is a hash, each line has "key=value" pairs seperated by commas), the bottleneck seems to be in getting the data into memory (in a usable structure) quickly, as I'm working with extremely large files (>250,000 line, 100MB+ files are normal in size, some are quite larger, and I have to process a large number of them often). Extracting just the parsing elements of my current python implementation: FILE: compare.py
import sys def parse(lines): print "Parsing",len(lines),"lines..." failed = 0 hashes = {} for line in lines: try: hash = {} parts = line.split(",") for part in parts: k,v = part.split("=") hash[k] = v hashes[ hash["LogOffset"] ] = hash except: if line != "": failed += 1 if failed > 0: print "[ERROR] Failed to parse:",failed,"lines" print "...Parsing resulted in",len(hashes),"unique hashes" return hashes
def normalize(lines): lineset = set() for line in lines: if "Type=Heartbeat" == line[0:14]: pass elif line == "": pass else: lineset.add( line ) #use set to get only uniques return lineset
def main(): hashes = parse( normalize ( open(sys.argv[1]).readlines() ))
if __name__ == '__main__': main()
$ time python compare.py 38807lineFile Removed 52 bad lines Parsing 38807 lines... ...Parsing resulted in 38807 unique hashes real 0m3.807s user 0m3.330s sys 0m0.470s $ time python compare.py 255919lineFile Removed 0 bad lines Parsing 255919 lines... ...Parsing resulted in 255868 unique hashes real 0m30.889s user 0m23.970s sys 0m2.900s *note: profiling shows over 7.1 million calls to the split() function, to give you a good idea of the number of pairs the file contains. Once you factor in much increased filesizes, actually preforming the analysis, and running it on a few dozen files, my tests started to become quite time consuming (not to mention it takes 1GB of memory for just the 250K line files, although execution time is still much more important at the moment thanks to ram being cheap). Thusly, I figured I'd rewrite it in C, but first I wanted to give Haskell a shot, if only to see how it compared to python (hoping maybe I could convince my boss to let me use it more often if the results were good). The first thing I wanted to check was if parsec was a viable option. Without even putting the data into lists/maps, I found it too slow. FILE: compare_parsec.hs
{-# OPTIONS_GHC -O2 #-} module Main where import System.Environment (getArgs) import Text.ParserCombinators.Parsec
csv = do x <- record `sepEndBy` many1 (oneOf "\n") eof return x record = field `sepBy` char ',' field = many (anyToken)
main = do ~[filename] <- getArgs putStrLn "Parsing log..." res <- parseFromFile csv filename case res of Left err -> print err Right xs -> putStrLn "...Success"
$ time ./compare_parsec 38807lineFile Parsing log... ...Success real 0m13.809s user 0m11.560s sys 0m2.180s $ time ./compare_parsec 255919lineFile Parsing log... ...Success real 1m28.983s user 1m8.480s sys 0m9.530s This, sadly, is significantly worse than the python code above. Perhaps someone here can offer advice on more efficient use of parsec? Unfortunately I don't have profiling libraries for parsec available on this machine, nor have I had any luck finding material on the web. After this, I tried doing my own parsing, since the format is strict and regular. I tried pushing it to lists (fast, not very usable) and maps (much easier for the analysis stage and what the python code does, but much worse speedwise). FILE: compare_lists.hs
{-# OPTIONS_GHC -O2 -fglasgow-exts #-} module Main where import System.Environment (getArgs) type Field = (String,String) type Record = [Field] type Log = [Record]
main = do ~[filename1] <- getArgs file1 <- readFile filename1 putStrLn "Parsing file1..." let data1 = parseLog file1 print data1 putStrLn "...Done"
-- parse file parseLog :: String -> Log parseLog log = foldr f [] (lines log) where f "" a = a f "\n" a = a f x a = (parseRecord x):a -- parse record parseRecord :: String -> Record parseRecord record = foldr (\x a -> (parseField x):a) [] (split ',' record) -- parse field -- no error detection/handling now parseField :: String -> Field parseField s = (takeWhile isntCharEq s, tail $ dropWhile isntCharEq s)
isntCharEq :: Char -> Bool isntCharEq '=' = False isntCharEq _ = True
split :: Eq a => a -> [a] -> [[a]] split delim = foldr f [[]] where f x rest@(r:rs) | x == delim = [] : rest | otherwise = (x:r) : rs
I wasn't sure the best way to force evaluation on this, so I opt'd to print it and direct stdout to /dev/null (and noted how long a program that just read and print the same file to /dev/null took as well). Any better suggestions are welcome. $ time ./compare_lists 38807lineFile > /dev/null real 0m4.958s (0m0.445s for the putStrLn $ readFile program) user 0m4.830s sys 0m0.110s $ time ./compare_lists 255919lineFile > /dev/null real 0m31.047s (0m2.620s for the putStrLn $ readFile program) user 0m30.310s sys 0m0.690s These speeds are relatively the same as python, but the data is easily usable in comparision, so finally I implemented the useful version: FILE: compare_maps.hs
{-# OPTIONS_GHC -O2 -fglasgow-exts #-} module Main where import qualified Data.Map as Map import System.Environment (getArgs) import Control.Monad
type Field = (String,String) type Record = Map.Map String String type Log = Map.Map String Record
main = do ~[filename] <- getArgs d <- liftM parseLog $ readFile filename putStr $ strMaps d -- parse file parseLog :: String -> Log parseLog logstr = foldr f Map.empty (lines logstr) where f "" a = a f "\n" a = a f x a = let (k,v) = parseRecord1 x in Map.insert k v a -- parse record line parseRecord1 :: String -> (String,Record) parseRecord1 recordstr = do let r = parseRecord recordstr (idRecord r, r)
-- extract unique id used for comparision amoungst line hashes in other files idRecord :: Record -> String idRecord r = case (Map.lookup "LogID" r) of Nothing -> "ERROR" Just id -> id
parseRecord :: String -> Record parseRecord recordstr = foldr f Map.empty (split ',' recordstr) where f x a = case parseField x of Nothing -> a Just (k,v) -> Map.insert k v a -- parse k=v fields parseField :: String -> Maybe Field parseField "" = Nothing parseField s = Just (takeWhile isntCharEq s, tail $ dropWhile isntCharEq s)
isntCharEq :: Char -> Bool isntCharEq '=' = False isntCharEq _ = True -- map functions strMaps = Map.fold (\x a -> (strMap x) ++ "\n" ++ a) "" strMap = unlines . map (\(k,v) -> k ++ ":" ++ v) . Map.toAscList -- list functions split :: Eq a => a -> [a] -> [[a]] split delim = foldr f [[]] where f x rest@(r:rs) | x == delim = [] : rest | otherwise = (x:r) : rs
$ time ./compare_maps 38807lineFile > /dev/null real 0m24.207s (0m0.445s for the putStrLn $ readFile program) user 0m22.430s sys 0m1.680s $ time ./compare_maps 255919lineFile > /dev/null real 5m24.927s (0m2.620s for the putStrLn $ readFile program) user 5m5.240s sys 0m12.770s Until this point I had only been throwing on -O2. I believe -fvia-C is included with -O currently, and I'm not well versed enough to know what others could help me out, so I left it at that. Next I tried checking out what the garbage collector was doing (I assume the footprint is smaller than python, but that would still be quite large). Flipping on -Sstderr I saw that only a third of the time spent was doing real work (67.9% doing GC). For the 38K file: 2,794,465,788 bytes allocated in the heap 1,956,668,360 bytes copied during GC (scavenged) 21,884,948 bytes copied during GC (not scavenged) 185,541,764 bytes maximum residency (16 sample(s)) 5329 collections in generation 0 ( 9.03s) 16 collections in generation 1 ( 5.99s) 523 Mb total memory in use MUT time 7.10s ( 7.80s elapsed) GC time 15.02s ( 16.68s elapsed) Total time 22.12s ( 24.48s elapsed) %GC time 67.9% (68.1% elapsed) For the 250K file: 17,509,820,952 bytes allocated in the heap 12,270,870,276 bytes copied during GC (scavenged) 136,831,020 bytes copied during GC (not scavenged) 1,307,822,788 bytes maximum residency (21 sample(s)) 33390 collections in generation 0 (121.65s) 21 collections in generation 1 ( 42.37s) 3244 Mb total memory in use MUT time 146.25s (151.51s elapsed) GC time 164.02s (172.73s elapsed) Total time 310.27s (324.24s elapsed) %GC time 52.9% (53.3% elapsed) Not being paticularly knowledgeable in this area, I tried some options I thought were reasonable after reading the (http://www.haskell.org/ghc/docs/latest/html/users_guide/runtime-control .html#rts-options-gc). The following are with the small (38K) file. Running with a larger heap (-H) didn't seem to help: 256m 35s 512m 37s 768m 37s 1g 29s Playing with the allocation area size (-A) didn't help either: 128k 29s 256k 25s (default) 512k 23s 768k 25s 1m 23s 2m 32s 5m 32s 10m 32s The 250K file shows improvement from the little bit I've played around with defaults -> 5m24s, 33.3K collections, 53% GC time -H1g -> 4m53s, 16.8K collecctions, 56% GC time -H3g -A5m -> 4m16s total time, 3500 collections, 52% GC time -H4g -A50m -> 4m26s total time, 350 collections, 57% GC time (-H3g -A5m and -H4g -A50m give approx. 4m16s total time, 3500 collections, but still 52% GC time), so I'll try to run some more tests with different options for it, but this still isn't even remotely close to python's speed, so I assume improving the code would prove more fruitful. At this point I'm out of ideas, so I was hoping someone could identify something stupid I've done (I'm still novice of FP in general, let alone for high performance) or direct me to a guide,website,paper,library, or some other form of help. For reference, the tests above were done on an 8 proc (possibly 4 real but hyperthreaded?) 2.8ghz Xeon workstation with 16gb memory runing an SMP enabled 2.4.21-47 linux kernel and ghc 6.6.1 Thanks -- Joseph Re -------------------------------------------------------- NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.

Re, Joseph (IT) wrote:
At this point I'm out of ideas, so I was hoping someone could identify something stupid I've done (I'm still novice of FP in general, let alone for high performance) or direct me to a guide,website,paper,library, or some other form of help.
Two ideas about your aproaches: (1) try to avoid explicit recursion by using some standard library functions instead. it's easier (once you learned the library) and may be faster (since the library may be written in a easy to optimize style). (2) try lazy ByteStrings, they should be faster. http://www.cse.unsw.edu.au/~dons/fps.html As an example, sorting of the individual lines of a csv files by key. csv parses the csv format, uncsv produces it. these functions can't handle '=' in the key or ',' in the key or value. treesort sorts by inserting stuff into a map and removing it in ascending order:
import System.Environment import qualified Data.ByteString.Lazy.Char8 as B import qualified Data.Map as Map import Control.Arrow (second)
csv = (map $ map $ second B.tail . B.break (== '=')) . (map $ B.split ',') . (B.split '\n')
uncsv = (B.join $ B.pack "\n") . (map $ B.join $ B.pack ",") . (map $ map $ \(key, val) -> B.concat [key, B.pack "=", val])
treesort = Map.toAscList . Map.fromList
main = B.interact $ uncsv . map treesort . csv
Tillmann

2007/7/20, Tillmann Rendel
Re, Joseph (IT) wrote:
At this point I'm out of ideas, so I was hoping someone could identify something stupid I've done (I'm still novice of FP in general, let alone for high performance) or direct me to a guide,website,paper,library, or some other form of help.
I think that your problem is simply an excess of lazyness :). foldr is too lazy: you pass huge structures to fold, that you evaluate anyway. Try to import Data.List and use foldl' (not foldl), that is strict. Being left fold, you probably need to rework a bit your functions (I tried to simply flip them, but the result of the program wereflipped too), but the time (and memory) difference is unbelievabl (foldl' is constant in memory allocation). Salvatore

Oh, I didn't realize how much of a difference it would make. Thanks a
lot!
-- Joseph Re
-----Original Message-----
From: Salvatore Insalaco [mailto:kirby81@gmail.com]
Sent: Friday, July 20, 2007 12:21 PM
To: Tillmann Rendel
Cc: Re, Joseph (IT); haskell-cafe@haskell.org
Subject: Re: [Haskell-cafe] Speedy parsing
2007/7/20, Tillmann Rendel
Re, Joseph (IT) wrote:
At this point I'm out of ideas, so I was hoping someone could identify something stupid I've done (I'm still novice of FP in general, let alone for high performance) or direct me to a guide,website,paper,library, or some other form of help.
I think that your problem is simply an excess of lazyness :). foldr is too lazy: you pass huge structures to fold, that you evaluate anyway. Try to import Data.List and use foldl' (not foldl), that is strict. Being left fold, you probably need to rework a bit your functions (I tried to simply flip them, but the result of the program wereflipped too), but the time (and memory) difference is unbelievabl (foldl' is constant in memory allocation). Salvatore -------------------------------------------------------- NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.

Ah, I thought I might have to resort to one of the ByteStrings modules. I've heard of them but was more or less avoiding them due to some complexities with installing extra libraries in my current dev environment. I'll try to work that out with the sysadmins and try it out. Thanks for the advice -----Original Message----- From: Tillmann Rendel [mailto:rendel@rbg.informatik.tu-darmstadt.de] Sent: Thursday, July 19, 2007 8:48 PM To: Re, Joseph (IT) Cc: haskell-cafe@haskell.org Subject: Re: [Haskell-cafe] Speedy parsing Re, Joseph (IT) wrote:
At this point I'm out of ideas, so I was hoping someone could identify
something stupid I've done (I'm still novice of FP in general, let alone for high performance) or direct me to a guide,website,paper,library, or some other form of help.
Two ideas about your aproaches: (1) try to avoid explicit recursion by using some standard library functions instead. it's easier (once you learned the library) and may be faster (since the library may be written in a easy to optimize style). (2) try lazy ByteStrings, they should be faster. http://www.cse.unsw.edu.au/~dons/fps.html As an example, sorting of the individual lines of a csv files by key. csv parses the csv format, uncsv produces it. these functions can't handle '=' in the key or ',' in the key or value. treesort sorts by inserting stuff into a map and removing it in ascending order:
import System.Environment import qualified Data.ByteString.Lazy.Char8 as B import qualified Data.Map as Map import Control.Arrow (second)
csv = (map $ map $ second B.tail . B.break (== '=')) . (map $ B.split ',') . (B.split '\n')
uncsv = (B.join $ B.pack "\n") . (map $ B.join $ B.pack ",") . (map $ map $ \(key, val) -> B.concat [key, B.pack "=", val])
treesort = Map.toAscList . Map.fromList
main = B.interact $ uncsv . map treesort . csv
Tillmann -------------------------------------------------------- NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.

Re, Joseph (IT) wrote:
Ah, I thought I might have to resort to one of the ByteStrings modules. I've heard of them but was more or less avoiding them due to some complexities with installing extra libraries in my current dev environment. I'll try to work that out with the sysadmins and try it out.
Maybe it is easier to just get ghc-6.6.1 - it contains Data.ByteString & Co in base library. Also, for cabal-ized libs (which is quickly becoming the rule), you can install with --user flag and give it a directory where you have write permission. On a different note: I've been wondering how difficult it would be to re-write Parsec so that it doesn't act on a /list/ of tokens but on a Sequence (defined in Data.Collections from collections package). Because this would mean one could use String as well as ByteString. Cheers Ben

On 22 jul 2007, at 23.46, Benjamin Franksen wrote:
On a different note: I've been wondering how difficult it would be to re-write Parsec so that it doesn't act on a /list/ of tokens but on a Sequence (defined in Data.Collections from collections package). Because this would mean one could use String as well as ByteString.
That's (Summer of Code) work in progress: http://code.google.com/soc/2007/haskell/appinfo.html? csaid=B97EF4562EF3B244 / Thomas

Thomas Schilling wrote:
On 22 jul 2007, at 23.46, Benjamin Franksen wrote:
On a different note: I've been wondering how difficult it would be to re-write Parsec so that it doesn't act on a /list/ of tokens but on a Sequence (defined in Data.Collections from collections package). Because this would mean one could use String as well as ByteString.
That's (Summer of Code) work in progress:
http://code.google.com/soc/2007/haskell/appinfo.html? csaid=B97EF4562EF3B244
Thanks, this is excellent news. Ben
participants (5)
-
Benjamin Franksen
-
Re, Joseph (IT)
-
Salvatore Insalaco
-
Thomas Schilling
-
Tillmann Rendel