
Hello gentlemen, I am exposed to functional programming for less than a month, and just trying to understand the concepts, so please bear with me. I tried to use Haskell for a simple task on my dayjob, that involved parsing mail system logs and counting distinct addresses (I work for a relatively big ISP). Reducing unnecessary details, let's say that we have a file of N lines, some of them repeating, so that there are only M distinct lines, where M << N. The task is to count the number of times each distinct line appears in the file, and print the most frequent one with its count. This is my program: ======== module Main where import Data.Map main = printMax . (foldr processLine empty) . lines =<< getContents processLine line map = insertWith (\new old -> new + old) line 1 map printMax map = putStrLn $ show $ foldWithKey (\key val accum -> if val > (snd accum) then (key,val) else accum) ("",0) map ======== The thing kinda works on small data sets, but if you feed it with 250,000 lines (1000 distinct), the process size grows to 200 Mb, and on 500,000 lines I get "*** Exception: stack overflow" (using runhaskell from ghc 6.2.4). (For comparison, a perl script does the same job (using global hash) an order of magnitude faster, consumes 3 Mb RAM and can process billion lines without a problem). The question is: what I am doing wrong? Thanks Eugene crosser@pccross:~/src$ perl genlist.pl 500000 1000|time runhaskell distinct.hs *** Exception: stack overflow Command exited with non-zero status 1 27.46user 0.49system 0:29.75elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+34676minor)pagefaults 0swaps crosser@pccross:~/src$ perl genlist.pl 250000 1000|time runhaskell distinct.hs ("a0000000531",300) 36.66user 0.72system 0:54.82elapsed 68%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (522major+50361minor)pagefaults 0swaps crosser@pccross:~/src$ perl genlist.pl 250000 1000|time perl distinct.pl a0000000531: 300 0.22user 0.00system 0:01.14elapsed 20%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (15major+472minor)pagefaults 0swaps