[newbie] processing large logs

13 May 2006

      Hello gentlemen,

I am exposed to functional programming for less than a month, and just
trying to understand the concepts, so please bear with me.

I tried to use Haskell for a simple task on my dayjob, that involved
parsing mail system logs and counting distinct addresses (I work for a
relatively big ISP).

Reducing unnecessary details, let's say that we have a file of N lines,
some of them repeating, so that there are only M distinct lines, where M
<< N.  The task is to count the number of times each distinct line
appears in the file, and print the most frequent one with its count.

This is my program:
========
module Main where
import Data.Map
main = printMax . (foldr processLine empty) . lines =<< getContents
processLine line map = insertWith (\new old -> new + old) line 1 map
printMax map = putStrLn $ show $ foldWithKey
   (\key val accum -> if val > (snd accum) then (key,val) else accum)
       ("",0) map
========
The thing kinda works on small data sets, but if you feed it with
250,000 lines (1000 distinct), the process size grows to 200 Mb, and on
500,000 lines I get "*** Exception: stack overflow" (using runhaskell
from ghc 6.2.4).  (For comparison, a perl script does the same job
(using global hash) an order of magnitude faster, consumes 3 Mb RAM and
can process billion lines without a problem).

The question is: what I am doing wrong?

Thanks
Eugene

crosser@pccross:~/src$ perl genlist.pl 500000 1000|time runhaskell distinct.hs *** Exception: stack overflow
Command exited with non-zero status 1
27.46user 0.49system 0:29.75elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+34676minor)pagefaults 0swaps

crosser@pccross:~/src$ perl genlist.pl 250000 1000|time runhaskell distinct.hs
("a0000000531",300)
36.66user 0.72system 0:54.82elapsed 68%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (522major+50361minor)pagefaults 0swaps

crosser@pccross:~/src$ perl genlist.pl 250000 1000|time perl distinct.pl
a0000000531: 300
0.22user 0.00system 0:01.14elapsed 20%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (15major+472minor)pagefaults 0swaps

Eugene Crosser

Udo Stenzel

Eugene Crosser

Antti-Juhani Kaijanaho

Udo Stenzel

Eugene Crosser

Antti-Juhani Kaijanaho

Udo Stenzel

Evan Martin

dons＠cse.unsw.edu.au

dons＠cse.unsw.edu.au

tags

participants (5)