More fun with micro-benchmarks and optimizations. (GHC vs Perl)

I have the need to regularly write tiny programs that analyze output logs. The output logs don't have a consistent formatting so I typically choose Perl for these tasks. The latest instance of having to write such a program was simple enough I figured I'd try my hand at using Haskell instead. The short story is that I could quickly write a Haskell program that achieved the goal. Yay! However, the performance was ~8x slower than a comparable Perl implementation. With a lot of effort I got the Haskell version to only 2x slower. A lot of the optimization was done with guesses that the performance difference was due to how each line was being read from the file. I couldn't determine much using GHC's profiler. I still have two questions after all this: - Can I get a Haskell implementation as fast as the Perl? - What do I need to do to get GHC's profiler to provide me usable information? Telling me that 98% of the time was in "main" is not very enlightening ;-) All the code and data for this little experiment I've placed here: http://tothepowerofdisco.com/repo/sum_optimization/ My first, and shortest, version is SumTiny.hs. The optimized version is SumFast.hs. The long version for the curious: The (cleaned up) data was a 78k line file consisting of lines like the following: framerate (prev == no pts): 15 framerate (delta): 25 framerate (invalid timebase): 12.5 ... and so on. The need was for a program that calculated 1.0 / framerate for each line and produced the sum. Easy no? My straightforward Haskell solution was: ------------------------------------------------------------------- import Text.Regex.Posix main = do f_lines <- readFile "test.out" >>= return . lines let duration = foldl add_line 0.0 f_lines add_line sum line = let [[_,m]] = line =~ "([0-9.]+)" framerate = read m delta = 1.0 / framerate in sum + delta putStrLn $ "Duration (sec): " ++ show duration ------------------------------------------------------------------- Just for fun I decided to compare it to a Perl implementation: ------------------------------------------------------------------- #!/usr/bin/perl my @f_lines = split(/\n/,`cat test.out`); my $sum = 0.0; foreach(@f_lines) { /([0-9.]+)/; my $delta = 1.0 / $1; $sum += $delta; } print("Duration (sec): ", $sum, "\n"); ------------------------------------------------------------------- (I'm sure there are other ways to write this same program in both languages.) I was pretty happy with how the Haskell implementation's code compared to the Perl implementation's just in terms of looks. Though I think the Perl implementation is easier to understand at the part where the regex match is extracted. For fun I compared the performance of the two: (64bit linux running on a 2.4ghz Core2Duo) $ time perl ./Sum.pl Duration (sec): 3155.62666666438 real 0m0.121s user 0m0.103s sys 0m0.016s $ time ./SumTiny Duration (sec): 3155.626666664377 real 0m1.099s user 0m1.073s sys 0m0.009s Youch! ~1s is fast enough that I don't care, but I'm still curious why there is a 8x performance difference. Profiling with manual cost center annotations (See SumTinyProf.hs) indicated the "do" expression main is equal to was responsible 90% of the tiny. Which isn't revealing at all! Some experimenting led me to find a 2x slower implementation (SumFast.hs). Quick notes ont he changes I made: - Used ByteStrings instead of String. - Used hGetLine instead of reading the entire file lazily and splitting on "\n" - Fusing the foldl' with the loop of reading each line from the file. - Using bang patterns to make the fused loop strict on the accumulator argument. I think the largest performance gain was from changing how each line was read from the file. The lazy read and split seemed very slow compared to a loop that used hGetLine. -- -Corey O'Connor

coreyoconnor:
I have the need to regularly write tiny programs that analyze output logs. The output logs don't have a consistent formatting so I typically choose Perl for these tasks.
The latest instance of having to write such a program was simple enough I figured I'd try my hand at using Haskell instead. The short story is that I could quickly write a Haskell program that achieved the goal. Yay! However, the performance was ~8x slower than a comparable Perl implementation. With a lot of effort I got the Haskell version to only 2x slower. A lot of the optimization was done with guesses that the performance difference was due to how each line was being read from the file. I couldn't determine much using GHC's profiler.
I still have two questions after all this: - Can I get a Haskell implementation as fast as the Perl? - What do I need to do to get GHC's profiler to provide me usable information? Telling me that 98% of the time was in "main" is not very enlightening ;-)
All the code and data for this little experiment I've placed here: http://tothepowerofdisco.com/repo/sum_optimization/ My first, and shortest, version is SumTiny.hs. The optimized version is SumFast.hs.
The long version for the curious:
The (cleaned up) data was a 78k line file consisting of lines like the following: framerate (prev == no pts): 15 framerate (delta): 25 framerate (invalid timebase): 12.5 ... and so on.
The need was for a program that calculated 1.0 / framerate for each line and produced the sum. Easy no?
My straightforward Haskell solution was: ------------------------------------------------------------------- import Text.Regex.Posix
main = do f_lines <- readFile "test.out" >>= return . lines let duration = foldl add_line 0.0 f_lines add_line sum line = let [[_,m]] = line =~ "([0-9.]+)" framerate = read m delta = 1.0 / framerate in sum + delta putStrLn $ "Duration (sec): " ++ show duration
Could you try using lazy and/or strict bytestrings. For *any* IO performance issue, that should be your first stop. See, e.g., this sum-file example, http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=ghc&id=6 which outperforms C on the benchmark. The second point would be to avoid Text.Regex.Posix, which is algorithmically much worse than the PCRE regexes used by perl. Instead, try the pcre-light or regex-pcre packages, which have both better complexity, and operate on efficient bytestrings. Finally, there's some small details about writing efficient loops. Give type declarations for atomic types like Int and Double, and use strict folds or explicit recursion, as in this post, http://cgi.cse.unsw.edu.au/~dons/blog/2008/05/16#fast For your task here, with those simple changes, it should be relatively easy to produce competitive-with-C performance. -- Don

Sounds great! Thanks for the advice. :-)
-Corey
On Wed, Jul 23, 2008 at 1:18 PM, Don Stewart
coreyoconnor:
I have the need to regularly write tiny programs that analyze output logs. The output logs don't have a consistent formatting so I typically choose Perl for these tasks.
The latest instance of having to write such a program was simple enough I figured I'd try my hand at using Haskell instead. The short story is that I could quickly write a Haskell program that achieved the goal. Yay! However, the performance was ~8x slower than a comparable Perl implementation. With a lot of effort I got the Haskell version to only 2x slower. A lot of the optimization was done with guesses that the performance difference was due to how each line was being read from the file. I couldn't determine much using GHC's profiler.
I still have two questions after all this: - Can I get a Haskell implementation as fast as the Perl? - What do I need to do to get GHC's profiler to provide me usable information? Telling me that 98% of the time was in "main" is not very enlightening ;-)
All the code and data for this little experiment I've placed here: http://tothepowerofdisco.com/repo/sum_optimization/ My first, and shortest, version is SumTiny.hs. The optimized version is SumFast.hs.
The long version for the curious:
The (cleaned up) data was a 78k line file consisting of lines like the following: framerate (prev == no pts): 15 framerate (delta): 25 framerate (invalid timebase): 12.5 ... and so on.
The need was for a program that calculated 1.0 / framerate for each line and produced the sum. Easy no?
My straightforward Haskell solution was: ------------------------------------------------------------------- import Text.Regex.Posix
main = do f_lines <- readFile "test.out" >>= return . lines let duration = foldl add_line 0.0 f_lines add_line sum line = let [[_,m]] = line =~ "([0-9.]+)" framerate = read m delta = 1.0 / framerate in sum + delta putStrLn $ "Duration (sec): " ++ show duration
Could you try using lazy and/or strict bytestrings. For *any* IO performance issue, that should be your first stop.
See, e.g., this sum-file example,
http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=ghc&id=6
which outperforms C on the benchmark.
The second point would be to avoid Text.Regex.Posix, which is algorithmically much worse than the PCRE regexes used by perl. Instead, try the pcre-light or regex-pcre packages, which have both better complexity, and operate on efficient bytestrings.
Finally, there's some small details about writing efficient loops. Give type declarations for atomic types like Int and Double, and use strict folds or explicit recursion, as in this post,
http://cgi.cse.unsw.edu.au/~dons/blog/2008/05/16#fast
For your task here, with those simple changes, it should be relatively easy to produce competitive-with-C performance.
-- Don
-- -Corey O'Connor

coreyoconnor:
I have the need to regularly write tiny programs that analyze output logs. The output logs don't have a consistent formatting so I typically choose Perl for these tasks.
The latest instance of having to write such a program was simple enough I figured I'd try my hand at using Haskell instead. The short story is that I could quickly write a Haskell program that achieved the goal. Yay! However, the performance was ~8x slower than a comparable Perl implementation. With a lot of effort I got the Haskell version to only 2x slower. A lot of the optimization was done with guesses that the performance difference was due to how each line was being read from the file. I couldn't determine much using GHC's profiler.
{-# OPTIONS -fbang-patterns #-} import qualified Data.ByteString.Char8 as S import Data.ByteString.Lex.Double import Debug.Trace main = print . go 0 =<< S.getContents where go !n !s = case readDouble str of Nothing -> n Just (k,t) -> let delta = 1.0 / k in go (n+delta) t where (_, xs) = S.break ((==) ':') s str = S.drop 2 xs It uses the bytestring-lexing package on Hackage to read the Doubles out, $ ghc --make Fast.hs -O2 -fvia-C -optc-O2 $ time ./Fast < test.out 3155.626666664377 ./Fast < test.out 0.07s user 0.01s system 97% cpu 0.078 total So that's twice as fast as the perl entry on my box, $ time perl Sum.pl < test.out Duration (sec): 3155.62666666438 perl Sum.pl < test.out 0.15s user 0.03s system 100% cpu 0.180 total Note that the safe Double lexer uses a bit of copying, so we can in fact do better still with a non-copying Double parser, but that's only for the hardcore. -- Don

And against gawk 3.1.5:
$ time awk -F: '{sum += 1 / $2} END{print sum}' test.out
3155.63
real 0m0.197s
user 0m0.184s
sys 0m0.004s
compared to Don's Haskell version:
$ time ./fastSum < test.out
3155.626666664377
real 0m0.072s
user 0m0.056s
sys 0m0.004s
compared to the Corey's perl version:
$ time perl Sum.pl
Duration (sec): 3155.62666666438
real 0m0.181s
user 0m0.164s
sys 0m0.012s
Regards,
Brad Larsen
On Wed, 23 Jul 2008 15:01:24 -0400, Don Stewart
coreyoconnor:
I have the need to regularly write tiny programs that analyze output logs. The output logs don't have a consistent formatting so I typically choose Perl for these tasks.
The latest instance of having to write such a program was simple enough I figured I'd try my hand at using Haskell instead. The short story is that I could quickly write a Haskell program that achieved the goal. Yay! However, the performance was ~8x slower than a comparable Perl implementation. With a lot of effort I got the Haskell version to only 2x slower. A lot of the optimization was done with guesses that the performance difference was due to how each line was being read from the file. I couldn't determine much using GHC's profiler.
{-# OPTIONS -fbang-patterns #-}
import qualified Data.ByteString.Char8 as S import Data.ByteString.Lex.Double import Debug.Trace
main = print . go 0 =<< S.getContents where go !n !s = case readDouble str of Nothing -> n Just (k,t) -> let delta = 1.0 / k in go (n+delta) t where (_, xs) = S.break ((==) ':') s str = S.drop 2 xs
It uses the bytestring-lexing package on Hackage to read the Doubles out,
$ ghc --make Fast.hs -O2 -fvia-C -optc-O2 $ time ./Fast < test.out 3155.626666664377 ./Fast < test.out 0.07s user 0.01s system 97% cpu 0.078 total
So that's twice as fast as the perl entry on my box,
$ time perl Sum.pl < test.out Duration (sec): 3155.62666666438 perl Sum.pl < test.out 0.15s user 0.03s system 100% cpu 0.180 total
Note that the safe Double lexer uses a bit of copying, so we can in fact do better still with a non-copying Double parser, but that's only for the hardcore.
-- Don _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

brad.larsen:
And against gawk 3.1.5:
$ time awk -F: '{sum += 1 / $2} END{print sum}' test.out 3155.63
real 0m0.197s user 0m0.184s sys 0m0.004s
compared to Don's Haskell version:
$ time ./fastSum < test.out 3155.626666664377
real 0m0.072s user 0m0.056s sys 0m0.004s
compared to the Corey's perl version:
$ time perl Sum.pl Duration (sec): 3155.62666666438
real 0m0.181s user 0m0.164s sys 0m0.012s
Another variant, using a non-copying readDouble (available in bytestring-lexing package 0.1.2), {-# OPTIONS -fbang-patterns #-} import qualified Data.ByteString.Char8 as S import qualified Data.ByteString.Unsafe as S import Data.ByteString.Lex.Double main = print . go 0 =<< S.getContents where go :: Double -> S.ByteString -> Double go !n !s = case unsafeReadDouble s' of Nothing -> n Just (k,t) -> let delta = 1 / k in go (n+delta) t where s' = case S.elemIndex ':' s of Nothing -> S.empty Just i -> S.unsafeDrop (i+2) s Computes in: $ time ./Fast < test.out 3155.626666664377 ./Fast < test.out 0.02s user 0.00s system 80% cpu 0.025 total So that's probably a good place to stop. Note the general rules: * strict bytestrings * tight tail-recursive loops with strict, atomic accumulators * non-copying parsing. -- Don

On Wed, Jul 23, 2008 at 11:10 AM, Corey O'Connor
I still have two questions after all this: - Can I get a Haskell implementation as fast as the Perl? - What do I need to do to get GHC's profiler to provide me usable information? Telling me that 98% of the time was in "main" is not very enlightening ;-)
Don did a great job answering teh first question. To get better profiling information, add "SCC" annotations to your code. That will tell you how much time is spent in a particular expression. For your main function, take a divide and conquer approach - add SCCs to determine what is taking the most time. Take Don's advice first though before benchmarking - you probably won't care after switching to bytestring & a better regex engine. Justin

On Wed, Jul 23, 2008 at 2:23 PM, Justin Bailey
On Wed, Jul 23, 2008 at 11:10 AM, Corey O'Connor
wrote: I still have two questions after all this: - Can I get a Haskell implementation as fast as the Perl? - What do I need to do to get GHC's profiler to provide me usable information? Telling me that 98% of the time was in "main" is not very enlightening ;-)
Don did a great job answering teh first question. To get better profiling information, add "SCC" annotations to your code. That will tell you how much time is spent in a particular expression. For your main function, take a divide and conquer approach - add SCCs to determine what is taking the most time.
I did write a version of SumTiny.hs with SCC annotations: http://tothepowerofdisco.com/repo/sum_optimization/SumTinyProf.hs Maybe I'm missing something, but the profiling still gives me poor results. Here is the results from the profiler (cut down and, no doubt, with poor post-email formatting): ------------------------------------------------------------------- COST CENTRE %time %alloc do 92.8 87.6 regex 2.8 4.8 add_line 2.2 0.4 div 1.1 0.3 readFile 0.6 6.2 individual inherited COST CENTRE %time %alloc %time %alloc MAIN CAF 0.0 0.0 100.0 100.0 main 0.0 0.0 100.0 100.0 do 92.8 87.6 100.0 100.0 show 0.0 0.0 0.0 0.0 foldl 0.0 0.3 0.0 0.3 add_line 2.2 0.4 6.6 5.9 regex 2.8 4.8 2.8 4.8 sum 0.0 0.1 0.0 0.1 div 1.1 0.3 1.1 0.3 read Double 0.6 0.2 0.6 0.2 readFile 0.6 6.2 0.6 6.2 ------------------------------------------------------------------- Which tells me the "do" expression that main is equal to is responsible for 92.8% of the run time. Which really doesn't tell me anything. -- -Corey O'Connor

On Wed, Jul 23, 2008 at 6:10 PM, Corey O'Connor
My straightforward Haskell solution was: ------------------------------------------------------------------- import Text.Regex.Posix
main = do f_lines <- readFile "test.out" >>= return . lines let duration = foldl add_line 0.0 f_lines add_line sum line = let [[_,m]] = line =~ "([0-9.]+)" framerate = read m delta = 1.0 / framerate in sum + delta putStrLn $ "Duration (sec): " ++ show duration -------------------------------------------------------------------
I'm sure the other suggestions have improved performance far more than this suggestion will, but it's still worth a shot. I don't think I've _ever_ seen a case where using foldl was a good idea. foldl interacts poorly with laziness; use foldr when the function you are folding is nonstrict, and use foldl' (from Data.List), a strict version of foldl, when the function you are folding is strict. This is, of course, only a rule of thumb. Changing that foldl to foldl' will probably speed this up a decent amount. Luke
participants (5)
-
Brad Larsen
-
Corey O'Connor
-
Don Stewart
-
Justin Bailey
-
Luke Palmer