I was writing a simple utility and I decided to use regexps to parse filenames. (I know, now I have two problems :-) )

I was surprised at how slow it ran, so I did a profiling build. The profiled code runs reasonably quickly, and is 7x faster, which makes it a bit hard to figure out where the slowdown is happening in the non-profiled code. I’m wondering if I’m doing something wrong, or if there’s a bug in regex-tdfa or in ghc.

I’ve pared my code down to just the following:

import Text.Regex.TDFA ((=~))

main :: IO ()
main = do
    entries <- map parseFilename . lines <$> getContents
    let check (Right (_, t)) = last t == 'Z'
        check _ = False
    print $ all check entries

parseFilename :: String -> Either String (String, String)
parseFilename fn = case (fn =~ pattern :: [[String]]) of
    [[_, full, _, time]] -> Right $ (full, time)
    _ -> Left fn
  where
    pattern =
        "^\\./duplicity-(full|inc|new)(-signatures)?\\.\
        \([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]T[0-9][0-9][0-9][0-9][0-9][0-9]Z)\\."

The relevant part of my .cabal file looks like this:

executable DuplicityAnalyzer
    main-is: DuplicityAnalyzer.hs
    build-depends:
        base >=4.6 && <4.11,
        regex-tdfa >= 1.0 && <1.3
    default-language: Haskell2010
    ghc-options: -Wall -rtsopts

To run the profiling, I do:

cabal clean
cabal configure --enable-profiling
cabal build
dist/build/DuplicityAnalyzer/DuplicityAnalyzer <names.in +RTS -sprofiling-summary.out -p

The MUT time in the non-profiling build is 7x bigger, and the %GC time goes from 8% to 21%. I’ve put the actual output in a gist. I’ve also put my test input file there, in case anyone wants to try this themselves.

I’ve done my testing with NixOS (ghc 8.0.2) and Debian with the Haskell Platform (ghc 8.2.1) and the results are basically the same. I even tried using Docker containers with Debian Jessie and Debian Stretch, just to eliminate any OS influence, and the results are still the same. I’ve tried it on an i5-2500K, i5-3317U and Xeon E5-1620.

I also wrote a dummy implementation of =~ that ignores the regex pattern and does a hard-coded manual parse, and that produces times just slightly better than the profiled ones. So I don’t think there’s a problem in my outer code that uses =~.