
gpnair78:
I really hope I'm missing some obvious optimization that's making it so slow compared to the perl version, hence this email soliciting feedback.
Here's my first attempt. 1.5s on a 2M line log file in the format you give. General notes: * unpack is almost always wrong. * list indexing with !! is almost always wrong. * words/lines are often wrong for parsing large files (they build large list structures). * toList/fromList probably aren't the best strategy * sortBy (comparing snd) * use insertWith' Spefically, avoid constructing intermediate lists, when you can process the entire file in a single pass. Use O(1) bytestring substring operations like take and drop. Compiling: $ ghc -O2 /tmp/B.hs --make Running: $ time /tmp/B ("GET /url1 HTTP/1.1]",1000000) ("GET /url2 HTTP/1.0]",500000) /tmp/B 1.38s user 0.21s system 99% cpu 1.595 total And the code: {-# LANGUAGE OverloadedStrings #-} {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Char8 as L import qualified Data.Map as M main = do l <- L.readFile "/tmp/x" mapM_ print . M.toList $ go l M.empty where go !s !acc | L.null s = acc | " 08:" `L.isPrefixOf` s1 = go s4 acc' | otherwise = go s4 acc where s1 = L.drop 11 s -- drop prefix to timestamp -- now extract the key (_,s2) = L.breakSubstring "GET" s1 (k,s3) = L.break ((==) ':') s2 -- drop the rest of the line s4 = L.tail (L.dropWhile ((/=) '\n') s3) acc' = M.insertWith' (+) k 1 acc ------------------------------------------------------------------------