
Replying to all replies at once:
Malcolm Wallace At work, we have a strict version of Haskell :-) which proofs that it is worth thinking about it.
Ertugrul If you want to save the time to learn how to write efficient Haskell programs, you may want to have a look into the Disciple language. Yes - I will. Its on my TODO list for at least 12 month :( Not sure whether there are parser combinator libraries yet.
@ Herbert Valerio Riedel (suggesting aeson) I gave it yet another try - see below. It still fails. @ Felipe Almeida Lessa (suggesting conduits and atto parsec) I mentioned that I already tried it. Counting lines only was a lot slower than counting lines and parsing JSON using PHP. @ Chris Wong
flag is that strictness isn't just something you turn on and off willy-nilly You're right. But those features are not required for all tasks :) Eg have a look at Data.Map. Would a strict implementation look that different?
I came up with this now. Trying strict bytestrings and Aeson. note the LB.fromChunks usage below to turn a strict into a lazy bytestring. Result: PHP script doing the same runs in 33secs (using the AFindCheckoutsAndContacts branch) The haskell program below - I stopped it after 8 min. (At least it does no longer cause my system to swap .. You're right: I could start profiling. I could learn about how to optimize it.) But why? The conclusion is that if I ever use yesod - and if I ever want to analyse logs - I should call PHP from yesod as external process !? :-( Even if I had a core i7 (8 threads in parallel) I still would not be sure whether Haskell would be the right choice. I agree that I assume that all data fits into memory so that piecewise evaluation doesn't matter too much. Thanks for all your help - it proofs once again that the Haskell community is the most helpful I know about. Yet I think me being the programmer Haskell is the wrong choice for this task. Thanks Marc Weber my new attempt - now I should start profiling.. Looks like I haven't built all libs with profiling support .. import Data.Aeson.Types import Data.Aeson import Data.List import Control.Applicative import Debug.Trace import qualified Data.Map as M import Action import Data.Aeson.Parser as AP import qualified Data.ByteString.Lazy as LB import qualified Data.ByteString as BS import qualified Data.ByteString.Char8 as LBC8 data Action = ACountLine | AFindCheckoutsAndContacts -- lines look like this: -- {"id":"4ee535f01550c","r":"","ua":"Mozilla\/5.0 (compatible; bingbot\/2.0; +http:\/\/www.bing.com\/bingbot.htm)","t":1323644400,"k":"XXX","v":"YY"} data Item = Item { id :: SB.ByteString, ua :: SB.ByteString, t :: Int, k :: SB.ByteString, v :: SB.ByteString } instance FromJSON Item where parseJSON (Object v) = Item <$> v .: "id" <*> v .: "ua" <*> v .: "t" <*> v .: "k" <*> v .: "v" parseJSON _ = empty run :: Action -> [FilePath] -> IO () run AFindCheckoutsAndContacts files = do -- find all ids quering the server more than 100 times. -- do so by building a map counting queries (lines :: [BS.ByteString]) <- fmap (concat . (map LBC8.lines) ) $ mapM BS.readFile files let processLine :: (M.Map BS.ByteString Int) -> BS.ByteString -> (M.Map BS.ByteString Int) processLine st line = case decode' (LB.fromChunks [line]) of Nothing -> st -- traceShow ("bad line " ++ (LBC8.unpack line)) st Just (Item id ua t k v) -> M.insertWith (+) k 1 st let count_by_id = foldl' processLine (M.empty) lines mapM_ print $ M.toList $ M.filter (> 100) count_by_id