help with Haskell performance

Hi all, The task I'm trying to accomplish: Given a log file containing several lines of white space delimited entries like this: [Sat Oct 24 08:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not exist [Sat Oct 24 08:12:37 2009] [error] GET /url2 HTTP/1.0]: Requested URI does not exist [Sat Oct 24 08:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not exist [Sat Oct 24 12:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not exist filter lines that match the string " 08:", extract the 6th, 7th and 8th words from that line, group all lines that have the the same resulting string, do a count on them and sort the result in descending order of counts and print it out. So in the example above we'd end up with an output like this: ("GET /url1 HTTP/1.1]:", 2) ("GET /url2 HTTP/1.0]:", 1) Seems pretty straightforward, so I wrote a simple perl script to achieve this task (see the bottom of this email). The input file is 335 MB in size and contains about 2 million log line entires in it. The perl script does a pretty decent job and finishes in about 3 seconds. Now the interesting part. I decided to implememt this in Haskell (being my favorite language and all) and ended up with the following code: --- begin haskell code --- import Text.Regex.Posix ( (=~) ) import qualified Data.List as List import qualified Data.Map as Map import qualified Data.ByteString.Lazy.Char8 as LB main = do contents <- LB.readFile "log_file" putStr . unlines . map ( show . (\(x, y) -> ((LB.unpack x), y)) ) . -- create a Map grouping & counting matching tokens and sort based on the counts List.sortBy (\(_, x) (_, y) -> y `compare` x) . Map.toList . Map.fromListWith (+) . filtertokens . LB.lines $ contents where filtertokens = foldr (\x acc -> if (f x) then ((g x) : acc) else acc) [] -- filter lines starting with " 08:" where f = (=~ " 08:") . LB.unpack -- extract tokens 6, 7 & 8 and create an association list like so ("GET /url2 HTTP/1.0]:", 1) g line = flip (,) 1 . LB.unwords . map (xs !!) $ [6, 7, 8] where xs = LB.words line --- end haskell code --- This haskell implementation takes a whopping 27 seconds to complete! About 9 times slower than the perl version! I'm using ghc 6.10.4, compiling with -O2 and even went to the extent of fusing an adjacent map and filter using a foldr like so: map f (filter g) => foldr ( if g x then f x ... ), fusing adjacents maps etc. Still the same result. I really hope I'm missing some obvious optimization that's making it so slow compared to the perl version, hence this email soliciting feedback. Thanks in advance. P.S. For reference, here's my corresponding perl implementation: --- start perl code --- #!/usr/bin/perl use strict; use warnings FATAL => 'all'; my %urls; open(FILE, '<', $ARGV[0]); while(<FILE>) { if (/ 08:/) { my @words = split; my $key = join(" ", ($words[6], $words[7], $words[8])); if (exists $urls{$key}) { $urls{$key}++ } else { $urls{$key} = 1 } } } for (sort { $urls{$b} <=> $urls{$a} } keys %urls) { print "($_, $urls{$_})\n" } --- end perl code ---

2009/11/8 Gokul P. Nair
Hi all,
The task I'm trying to accomplish:
Given a log file containing several lines of white space delimited entries like this:
[Sat Oct 24 08:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not exist [Sat Oct 24 08:12:37 2009] [error] GET /url2 HTTP/1.0]: Requested URI does not exist [Sat Oct 24 08:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not exist [Sat Oct 24 12:12:37 2009] [error] GET /url1 HTTP/1.1]: Requested URI does not exist
filter lines that match the string " 08:", extract the 6th, 7th and 8th words from that line, group all lines that have the the same resulting string, do a count on them and sort the result in descending order of counts and print it out. So in the example above we'd end up with an output like this:
("GET /url1 HTTP/1.1]:", 2) ("GET /url2 HTTP/1.0]:", 1)
Seems pretty straightforward, so I wrote a simple perl script to achieve this task (see the bottom of this email).
The input file is 335 MB in size and contains about 2 million log line entires in it. The perl script does a pretty decent job and finishes in about 3 seconds.
Now the interesting part. I decided to implememt this in Haskell (being my favorite language and all) and ended up with the following code:
--- begin haskell code ---
import Text.Regex.Posix ( (=~) )
First, you are using Text.Regex.Posix which is dog slow. Use regex-tdfa or regex-pcre. Second, you are using it on a String! Third, you are unpacking a lazy bytestring for that! This alone is more than enough to make things ridiculously slow.
import qualified Data.List as List import qualified Data.Map as Map import qualified Data.ByteString.Lazy.Char8 as LB
main = do contents <- LB.readFile "log_file" putStr . unlines . map ( show . (\(x, y) -> ((LB.unpack x), y)) ) .
This piece of code also does unpacking. If the output is small, it's ok, otherwise assemble output in the form of a bytestring and print it with B.putStr.
-- create a Map grouping & counting matching tokens and sort based on the counts List.sortBy (\(_, x) (_, y) -> y `compare` x) . Map.toList . Map.fromListWith (+) . filtertokens . The lambda can be replaced by "flip (comparing snd)"
LB.lines $ contents Here, you should not use Map.fromListWith (+) because Map is not strict in its entries and you end up having big fat thunks there. You should use Map.fromListWith plus where x `plus` y = x `seq` y `seq` x+y.
where filtertokens = foldr (\x acc -> if (f x) then ((g x) : acc) else acc) [] -- filter lines starting with " 08:" where f = (=~ " 08:") . LB.unpack -- extract tokens 6, 7 & 8 and create an association list like so ("GET /url2 HTTP/1.0]:", 1) g line = flip (,) 1 . LB.unwords . map (xs !!) $ [6, 7, 8] where xs = LB.words line You are using random access on a list three times in a row. map (xs!!) [6,7,8] is much faster when implemented as "take 3 (drop 6 xs)".
So.. Well. Try these suggestions, show the resulting performance and then we'll see.
--- end haskell code ---
This haskell implementation takes a whopping 27 seconds to complete! About 9 times slower than the perl version! I'm using ghc 6.10.4, compiling with -O2 and even went to the extent of fusing an adjacent map and filter using a foldr like so: map f (filter g) => foldr ( if g x then f x ... ), fusing adjacents maps etc. Still the same result.
I really hope I'm missing some obvious optimization that's making it so slow compared to the perl version, hence this email soliciting feedback.
Thanks in advance.
P.S. For reference, here's my corresponding perl implementation:
--- start perl code ---
#!/usr/bin/perl use strict; use warnings FATAL => 'all';
my %urls; open(FILE, '<', $ARGV[0]); while(<FILE>) { if (/ 08:/) { my @words = split; my $key = join(" ", ($words[6], $words[7], $words[8])); if (exists $urls{$key}) { $urls{$key}++ } else { $urls{$key} = 1 } } } for (sort { $urls{$b} <=> $urls{$a} } keys %urls) { print "($_, $urls{$_})\n" }
--- end perl code ---
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Eugene Kirpichov Web IR developer, market.yandex.ru

On Sun, Nov 08, 2009 at 12:15:57AM +0300, Eugene Kirpichov wrote:
Here, you should not use Map.fromListWith (+) because Map is not strict in its entries and you end up having big fat thunks there. You should use Map.fromListWith plus where x `plus` y = x `seq` y `seq` x+y.
fromListWith is implemented with insertWithKey; is defining plus like above enough? I would guess that we need fromListWith' implemented with insertWithKey'. The problem is that it will build a thunk (x `plus` y) and your seq's only say that whenever you evaluate that thunk you'll evaluate the arguments as well. If the thunk itself isn't needed, it won't be evaluated. -- Felipe.

Ah, you're right. Then we need a foldl' insertWith with a strict plus.
2009/11/8 Felipe Lessa
On Sun, Nov 08, 2009 at 12:15:57AM +0300, Eugene Kirpichov wrote:
Here, you should not use Map.fromListWith (+) because Map is not strict in its entries and you end up having big fat thunks there. You should use Map.fromListWith plus where x `plus` y = x `seq` y `seq` x+y.
fromListWith is implemented with insertWithKey; is defining plus like above enough? I would guess that we need fromListWith' implemented with insertWithKey'.
The problem is that it will build a thunk (x `plus` y) and your seq's only say that whenever you evaluate that thunk you'll evaluate the arguments as well. If the thunk itself isn't needed, it won't be evaluated.
-- Felipe. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Eugene Kirpichov Web IR developer, market.yandex.ru

gpnair78:
I really hope I'm missing some obvious optimization that's making it so slow compared to the perl version, hence this email soliciting feedback.
Here's my first attempt. 1.5s on a 2M line log file in the format you give. General notes: * unpack is almost always wrong. * list indexing with !! is almost always wrong. * words/lines are often wrong for parsing large files (they build large list structures). * toList/fromList probably aren't the best strategy * sortBy (comparing snd) * use insertWith' Spefically, avoid constructing intermediate lists, when you can process the entire file in a single pass. Use O(1) bytestring substring operations like take and drop. Compiling: $ ghc -O2 /tmp/B.hs --make Running: $ time /tmp/B ("GET /url1 HTTP/1.1]",1000000) ("GET /url2 HTTP/1.0]",500000) /tmp/B 1.38s user 0.21s system 99% cpu 1.595 total And the code: {-# LANGUAGE OverloadedStrings #-} {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Char8 as L import qualified Data.Map as M main = do l <- L.readFile "/tmp/x" mapM_ print . M.toList $ go l M.empty where go !s !acc | L.null s = acc | " 08:" `L.isPrefixOf` s1 = go s4 acc' | otherwise = go s4 acc where s1 = L.drop 11 s -- drop prefix to timestamp -- now extract the key (_,s2) = L.breakSubstring "GET" s1 (k,s3) = L.break ((==) ':') s2 -- drop the rest of the line s4 = L.tail (L.dropWhile ((/=) '\n') s3) acc' = M.insertWith' (+) k 1 acc ------------------------------------------------------------------------

--- On Sat, 11/7/09, Don Stewart
General notes:
* unpack is almost always wrong. * list indexing with !! is almost always wrong. * words/lines are often wrong for parsing large files (they build large list structures). * toList/fromList probably aren't the best strategy * sortBy (comparing snd) * use insertWith' Spefically, avoid constructing intermediate lists, when you can process the entire file in a single pass. Use O(1) bytestring substring operations like take and drop. Thanks all for the valuable feedback. Switching from Regex.Posix to Regex.PCRE alone reduced the running time to about 6 secs and a few other optimizations suggested on this thread brought it down to about 5 secs ;) I then set out to profile the code out of curiosity to see where the bulk of the time was being spent and sure enough the culprit turned out to be "unpack". My question therefore is, given a list L1 of type [(ByteString, Int)], how do I print it out so as to eliminate the "chunk, empty" markers associated with a bytestring? The suggestions posted here are along the lines of "mapM_ print L1" but that's far from desirable especially because the generated output is for perusal by non-technical users etc. Thanks.

On Tue, Nov 10, 2009 at 8:20 PM, Gokul P. Nair
General notes:
* unpack is almost always wrong. * list indexing with !! is almost always wrong. * words/lines are often wrong for parsing large files (they build large
* toList/fromList probably aren't the best strategy * sortBy (comparing snd) * use insertWith' Spefically, avoid constructing intermediate lists, when you can process
entire file in a single pass. Use O(1) bytestring substring operations
--- On Sat, 11/7/09, Don Stewart
wrote: list structures). the like take and drop.
Thanks all for the valuable feedback. Switching from Regex.Posix to Regex.PCRE alone reduced the running time to about 6 secs and a few other optimizations suggested on this thread brought it down to about 5 secs ;)
I then set out to profile the code out of curiosity to see where the bulk of the time was being spent and sure enough the culprit turned out to be "unpack". My question therefore is, given a list L1 of type [(ByteString, Int)], how do I print it out so as to eliminate the "chunk, empty" markers associated with a bytestring? The suggestions posted here are along the lines of "mapM_ print L1" but that's far from desirable especially because the generated output is for perusal by non-technical users etc.
Thanks.
Take a look at Data.ByteString.Lazy.Char8.putStrLn. That prints a lazy ByteString without unpacking it, and without the internal markers. Sincerely, Brad
participants (6)
-
Brad Larsen
-
Chaddaï Fouché
-
Don Stewart
-
Eugene Kirpichov
-
Felipe Lessa
-
Gokul P. Nair