Speedy parsing

19 Jul 2007

      I was hoping someone could direct me to material on how I might go about
optimizing a small parsing program I wrote.  Used for calculating
differences between two files that store a hash of hashes (each line is
a hash, each line has "key=value" pairs seperated by commas), the
bottleneck seems to be in getting the data into memory (in a usable
structure) quickly, as I'm working with extremely large files (>250,000
line, 100MB+ files are normal in size, some are quite larger, and I have
to process a large number of them often).

Extracting just the parsing elements of my current python
implementation:
FILE: compare.py
...
import sys
def parse(lines):
   print "Parsing",len(lines),"lines..."
   failed = 0
   hashes = {}
   for line in lines:
      try:
         hash = {}
         parts = line.split(",")
         for part in parts:
            k,v = part.split("=")
            hash[k] = v
         hashes[ hash["LogOffset"] ] = hash
      except:
         if line != "":
            failed += 1
   if failed > 0:
      print "[ERROR] Failed to parse:",failed,"lines"
   print "...Parsing resulted in",len(hashes),"unique hashes"
   return hashes
def normalize(lines):
   lineset = set()
   for line in lines:
      if "Type=Heartbeat" == line[0:14]: pass
      elif line == "": pass
      else: lineset.add( line ) #use set to get only uniques
   return lineset
def main():
   hashes = parse( normalize ( open(sys.argv[1]).readlines() ))
if __name__ == '__main__': main()
$ time python compare.py 38807lineFile
Removed 52 bad lines
Parsing 38807 lines...
...Parsing resulted in 38807 unique hashes

real    0m3.807s
user    0m3.330s
sys     0m0.470s

$ time python compare.py 255919lineFile
Removed 0 bad lines
Parsing 255919 lines...
...Parsing resulted in 255868 unique hashes

real    0m30.889s
user    0m23.970s
sys     0m2.900s

*note: profiling shows over 7.1 million calls to the split() function,
to give you a good idea of the number of pairs the file contains.

Once you factor in much increased filesizes, actually preforming the
analysis, and running it on a few dozen files, my tests started to
become quite time consuming (not to mention it takes 1GB of memory for
just the 250K line files, although execution time is still much more
important at the moment thanks to ram being cheap).  

Thusly, I figured I'd rewrite it in C, but first I wanted to give
Haskell a shot, if only to see how it compared to python (hoping maybe I
could convince my boss to let me use it more often if the results were
good).  The first thing I wanted to check was if parsec was a viable
option.  Without even putting the data into lists/maps, I found it too
slow.

FILE: compare_parsec.hs
...
{-# OPTIONS_GHC -O2 #-}
module Main where
import System.Environment (getArgs)
import Text.ParserCombinators.Parsec
csv = do x <- record `sepEndBy` many1 (oneOf "\n")
         eof
         return x
record = field `sepBy` char ','
field = many (anyToken)
main = do
   ~[filename] <- getArgs
   putStrLn "Parsing log..."
   res <- parseFromFile csv filename
   case res of
      Left err -> print err
      Right xs -> putStrLn "...Success"
$ time ./compare_parsec 38807lineFile
Parsing log...
...Success

real    0m13.809s
user    0m11.560s
sys     0m2.180s

$ time ./compare_parsec 255919lineFile
Parsing log...
...Success

real    1m28.983s
user    1m8.480s
sys     0m9.530s

This, sadly, is significantly worse than the python code above.  Perhaps
someone here can offer advice on more efficient use of parsec?
Unfortunately I don't have profiling libraries for parsec available on
this machine, nor have I had any luck finding material on the web.

After this, I tried doing my own parsing, since the format is strict and
regular.  I tried pushing it to lists (fast, not very usable) and maps
(much easier for the analysis stage and what the python code does, but
much worse speedwise).

FILE: compare_lists.hs
...
{-# OPTIONS_GHC -O2 -fglasgow-exts #-}
module Main where
import System.Environment (getArgs)
type Field = (String,String)
type Record = [Field]
type Log = [Record]
main = do
   ~[filename1] <- getArgs
   file1 <- readFile filename1
   putStrLn "Parsing file1..."
   let data1 = parseLog file1
   print data1
   putStrLn "...Done"
-- parse file
parseLog :: String -> Log
parseLog log = foldr f [] (lines log)
   where f "" a = a
         f "\n" a = a
         f x a = (parseRecord x):a
-- parse record
parseRecord :: String -> Record
parseRecord record = foldr (\x a -> (parseField x):a) [] (split ','
record)
-- parse field
-- no error detection/handling now
parseField :: String -> Field
parseField s = (takeWhile isntCharEq s, tail $ dropWhile isntCharEq s)
isntCharEq :: Char -> Bool
isntCharEq '=' = False
isntCharEq _ = True
split :: Eq a => a -> [a] -> [[a]]
split delim = foldr f [[]]
   where
      f x rest@(r:rs)
        | x == delim = [] : rest
        | otherwise = (x:r) : rs
I wasn't sure the best way to force evaluation on this, so I opt'd to
print it and direct stdout to /dev/null (and noted how long a program
that just read and print the same file to /dev/null took as well).  Any
better suggestions are welcome.

$ time ./compare_lists 38807lineFile > /dev/null
real    0m4.958s (0m0.445s for the putStrLn $ readFile program)
user    0m4.830s
sys     0m0.110s

$ time ./compare_lists 255919lineFile > /dev/null
real    0m31.047s (0m2.620s for the putStrLn $ readFile program)
user    0m30.310s
sys     0m0.690s

These speeds are relatively the same as python, but the data is easily
usable in comparision, so finally I implemented the useful version:

FILE: compare_maps.hs
...
{-# OPTIONS_GHC -O2 -fglasgow-exts #-}
module Main where
import qualified Data.Map as Map
import System.Environment (getArgs)
import Control.Monad
type Field = (String,String)
type Record = Map.Map String String
type Log = Map.Map String Record
main = do
   ~[filename] <- getArgs
   d <- liftM parseLog $ readFile filename
   putStr $ strMaps d
-- parse file
parseLog :: String -> Log
parseLog logstr = foldr f Map.empty (lines logstr)
   where f "" a = a
         f "\n" a = a
         f x a = let (k,v) = parseRecord1 x
                 in Map.insert k v a
-- parse record line
parseRecord1 :: String -> (String,Record)
parseRecord1 recordstr = do let r = parseRecord recordstr
                            (idRecord r, r)
-- extract unique id used for comparision amoungst line hashes in
other files
idRecord :: Record -> String
idRecord r = case (Map.lookup "LogID" r) of
               Nothing -> "ERROR"
               Just id -> id
parseRecord :: String -> Record
parseRecord recordstr = foldr f Map.empty (split ',' recordstr)
   where f x a = case parseField x of
                  Nothing -> a
                  Just (k,v) -> Map.insert k v a
-- parse k=v fields
parseField :: String -> Maybe Field
parseField "" = Nothing
parseField s = Just (takeWhile isntCharEq s, tail $ dropWhile
isntCharEq s)
isntCharEq :: Char -> Bool
isntCharEq '=' = False
isntCharEq _ = True
-- map functions
strMaps = Map.fold (\x a -> (strMap x) ++ "\n" ++ a) ""
strMap = unlines . map (\(k,v) -> k ++ ":" ++ v) . Map.toAscList
-- list functions
split :: Eq a => a -> [a] -> [[a]]
split delim = foldr f [[]]
   where
      f x rest@(r:rs)
        | x == delim = [] : rest
        | otherwise = (x:r) : rs
$ time ./compare_maps 38807lineFile > /dev/null
real    0m24.207s (0m0.445s for the putStrLn $ readFile program)
user    0m22.430s
sys     0m1.680s

$ time ./compare_maps 255919lineFile > /dev/null
real    5m24.927s (0m2.620s for the putStrLn $ readFile program)
user    5m5.240s
sys     0m12.770s

Until this point I had only been throwing on -O2.  I believe -fvia-C is
included with -O currently, and I'm not well versed enough to know what
others could help me out, so I left it at that.

Next I tried checking out what the garbage collector was doing (I assume
the footprint is smaller than python, but that would still be quite
large).  Flipping on -Sstderr I saw that only a third of the time spent
was doing real work (67.9% doing GC).

For the 38K file:
2,794,465,788 bytes allocated in the heap
1,956,668,360 bytes copied during GC (scavenged)
21,884,948 bytes copied during GC (not scavenged)
185,541,764 bytes maximum residency (16 sample(s))
	5329 collections in generation 0 (  9.03s)
      16 collections in generation 1 (  5.99s)
	523 Mb total memory in use
MUT   time    7.10s  (  7.80s elapsed)
GC    time   15.02s  ( 16.68s elapsed)
Total time   22.12s  ( 24.48s elapsed)
%GC time      67.9%  (68.1% elapsed)

For the 250K file:
17,509,820,952 bytes allocated in the heap
12,270,870,276 bytes copied during GC (scavenged)
136,831,020 bytes copied during GC (not scavenged)
1,307,822,788 bytes maximum residency (21 sample(s))
	33390 collections in generation 0 (121.65s)
	21 collections in generation 1 ( 42.37s)
	3244 Mb total memory in use
MUT   time  146.25s  (151.51s elapsed)
GC    time  164.02s  (172.73s elapsed)
Total time  310.27s  (324.24s elapsed)
%GC time      52.9%  (53.3% elapsed)

Not being paticularly knowledgeable in this area, I tried some options I
thought were reasonable after reading the
(http://www.haskell.org/ghc/docs/latest/html/users_guide/runtime-control
.html#rts-options-gc).  The following are with the small (38K) file.

Running with a larger heap (-H) didn't seem to help:
256m 35s
512m 37s
768m 37s
1g   29s

Playing with the allocation area size (-A) didn't help either:
128k 29s
256k 25s (default)
512k 23s
768k 25s
1m   23s
2m   32s
5m   32s
10m  32s

The 250K file shows improvement from the little bit I've played around
with
defaults -> 5m24s, 33.3K collections, 53% GC time
-H1g -> 4m53s, 16.8K collecctions, 56% GC time
-H3g -A5m -> 4m16s total time, 3500 collections, 52% GC time
-H4g -A50m -> 4m26s total time, 350 collections, 57% GC time

 (-H3g -A5m and -H4g -A50m give approx. 4m16s total time, 3500
collections, but still 52% GC time), so I'll try to run some more tests
with different options for it, but this still isn't even remotely close
to python's speed, so I assume improving the code would prove more
fruitful.

At this point I'm out of ideas, so I was hoping someone could identify
something stupid I've done (I'm still novice of FP in general, let alone
for high performance) or direct me to a guide,website,paper,library, or
some other form of help.

For reference, the tests above were done on an 8 proc (possibly 4 real
but hyperthreaded?) 2.8ghz Xeon workstation with 16gb memory runing an
SMP enabled 2.4.21-47 linux kernel and ghc 6.6.1

Thanks

-- Joseph Re
--------------------------------------------------------

NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.

Re, Joseph (IT)

Tillmann Rendel

Salvatore Insalaco

Re, Joseph (IT)

Re, Joseph (IT)

Benjamin Franksen

Thomas Schilling

Benjamin Franksen

tags

participants (5)