
Hello gentlemen, I am exposed to functional programming for less than a month, and just trying to understand the concepts, so please bear with me. I tried to use Haskell for a simple task on my dayjob, that involved parsing mail system logs and counting distinct addresses (I work for a relatively big ISP). Reducing unnecessary details, let's say that we have a file of N lines, some of them repeating, so that there are only M distinct lines, where M << N. The task is to count the number of times each distinct line appears in the file, and print the most frequent one with its count. This is my program: ======== module Main where import Data.Map main = printMax . (foldr processLine empty) . lines =<< getContents processLine line map = insertWith (\new old -> new + old) line 1 map printMax map = putStrLn $ show $ foldWithKey (\key val accum -> if val > (snd accum) then (key,val) else accum) ("",0) map ======== The thing kinda works on small data sets, but if you feed it with 250,000 lines (1000 distinct), the process size grows to 200 Mb, and on 500,000 lines I get "*** Exception: stack overflow" (using runhaskell from ghc 6.2.4). (For comparison, a perl script does the same job (using global hash) an order of magnitude faster, consumes 3 Mb RAM and can process billion lines without a problem). The question is: what I am doing wrong? Thanks Eugene crosser@pccross:~/src$ perl genlist.pl 500000 1000|time runhaskell distinct.hs *** Exception: stack overflow Command exited with non-zero status 1 27.46user 0.49system 0:29.75elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+34676minor)pagefaults 0swaps crosser@pccross:~/src$ perl genlist.pl 250000 1000|time runhaskell distinct.hs ("a0000000531",300) 36.66user 0.72system 0:54.82elapsed 68%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (522major+50361minor)pagefaults 0swaps crosser@pccross:~/src$ perl genlist.pl 250000 1000|time perl distinct.pl a0000000531: 300 0.22user 0.00system 0:01.14elapsed 20%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (15major+472minor)pagefaults 0swaps

Eugene Crosser wrote:
This is my program: ======== module Main where import Data.Map main = printMax . (foldr processLine empty) . lines =<< getContents processLine line map = insertWith (\new old -> new + old) line 1 map printMax map = putStrLn $ show $ foldWithKey (\key val accum -> if val > (snd accum) then (key,val) else accum) ("",0) map ======== The thing kinda works on small data sets, but if you feed it with 250,000 lines (1000 distinct), the process size grows to 200 Mb, and on 500,000 lines I get "*** Exception: stack overflow"
Your program isn't strict enough. While you expect it to keep a "running total" in the map which is updated with each new line, it really only creates lots of thunks that are only evaluated when the result is demanded. These thunks are as large as the input plus overhead. You have to force the evaluation of intermediate results. To do so, you have to replace foldr by foldl (foldr is just recursion, foldl is accumulator recursion), then use the strict variant of that, and then evaluate all values before putting them into the map. In summary, this should work (untested code, note the use of foldl'): main = printMax . (foldl' processLine empty) . lines =<< getContents processLine map line = let total = findWithDefault 0 line map + 1 in total `seq` insert line total map Yes, this is all terribly non-obvious. It takes time until you see where lazyness is going to hurt you, and you'll easily overlook some such situations. I also think, it's an unfortunate oversight that insertWith is lazy and that there's no way to make it strict as a mere user of Data.Map. Udo. -- "Guy Steele leads a small team of researchers in Burlington, Massachusetts, who are taking on an _enormous_challenge_ -- create a programming language better than Java." -- Sun.Com (emphasis by Paul Graham)

Udo Stenzel wrote:
Eugene Crosser wrote:
This is my program: ======== module Main where import Data.Map main = printMax . (foldr processLine empty) . lines =<< getContents processLine line map = insertWith (\new old -> new + old) line 1 map printMax map = putStrLn $ show $ foldWithKey (\key val accum -> if val > (snd accum) then (key,val) else accum) ("",0) map ========
You have to force the evaluation of intermediate results. To do so, you have to replace foldr by foldl (foldr is just recursion, foldl is accumulator recursion),
Having read "Yet another Haskell tutorial" (note on p.20), doesn't foldl have to read the complete list before it can start processing it (beginning from the last element)? As opposed to foldr that can fetch elements one by one as they are needed? Otherwise, point on strictness taken... Well, apparently the whole deal is even more weird than it happened at the first glance... Eugene

Eugene Crosser wrote:
Having read "Yet another Haskell tutorial" (note on p.20), doesn't foldl have to read the complete list before it can start processing it (beginning from the last element)? As opposed to foldr that can fetch elements one by one as they are needed?
They're complementary. If the result is of a type where partial evaluation is possible (say, a list: between "not evaluated" and "fully evaluated", there are as many intermediate stages of evaluation as there are elements in the list), then foldr is the better choice, as it constructs the output list (or whatever) lazily. (You also need to make sure that the fold parameter function is lazy in the "rest of output" parameter.) If the result is of a type that doesn't allow partial evaluation (an integer, for example: there is no intermediate stage between "not evaluated" and "fully evaluated"), or used in a context where laziness is not a virtue, then it pays to avoid laziness in its evaluation: hence foldl' is the better choice. (You also need to make sure that the fold parameter function is strict in the accumulator parameter.) In elementary (nth-language) Haskell, one is generally trying to learn the stuff about Haskell that is *different* from conventional languages, so in elementary tutorials the rule of thumb "foldr is better" works. It's just one of the half-lies that people get told in elementary courses that one needs to augment in later stages of learning :)

Eugene Crosser wrote:
Having read "Yet another Haskell tutorial" (note on p.20), doesn't foldl have to read the complete list before it can start processing it (beginning from the last element)? As opposed to foldr that can fetch elements one by one as they are needed?
Both foldl and foldr start from the left of the list; dictated by the structure of the list datatype nothing else is possible. The actual difference is that foldl passes an accumulator along and returns the final value of the accumulator. This also means that foldl is tail recursive and foldr isn't. Depending on what you want to do, both combinators can start processing right away. foldr does so only if the folded function is lazy in its second argument (the list constructor is an example of such a function, but Map.insert isn't), foldl' always does so. You cannot get a result from foldl' before the complete input is consumed. If it's still unclear, you should take the definitions of foldr, foldl and foldl' and simulate their reductions by hand on paper. You should see how foldr cannot apply a strict function (like (+)) before the complete(!) list is transformed into a gargantuan thunk, how foldl just plain refuses to apply the obviously needed reduction step and cannot be persuaded to do so and how foldl' is what you want. You'll also see how everything is different for a lazy funktion (like (:)). Udo. -- It is easier to get forgiveness than permission.

Udo Stenzel wrote:
Eugene Crosser wrote:
Having read "Yet another Haskell tutorial" (note on p.20), doesn't foldl have to read the complete list before it can start processing it (beginning from the last element)? As opposed to foldr that can fetch elements one by one as they are needed?
Both foldl and foldr start from the left of the list; dictated by the structure of the list datatype nothing else is possible. The actual difference is that foldl passes an accumulator along and returns the final value of the accumulator. This also means that foldl is tail recursive and foldr isn't.
I think that I get it now. foldl will actually yield any result when it hits the end of the list, while foldr will give you partial result (if partial result makes any sense, that is) after each iteration. And to get any advantage of the latter, you need to be able to consume that "partial result" element-by-element. Right? Anyway, I understand that you used 'seq' in your example as a way to "strictify" the function that updates accumulator. Could you (or anyone) explain (in plain English, preferably:) the reason why 'seq' is the way it is. In the first place, why does it have the first argument at all, and what should you put there? Eugene P.S. just FYI: after the changes, my benchmark program stops growing with the growth of data set, and in compiled form it has the same RAM footprint as the equivalent (interpreted) perl script. Still, it consumes 20 times more CPU... P.P.S. Thanks people, you are really helpful!

Eugene Crosser wrote:
Anyway, I understand that you used 'seq' in your example as a way to "strictify" the function that updates accumulator. Could you (or anyone) explain (in plain English, preferably:) the reason why 'seq' is the way it is. In the first place, why does it have the first argument at all, and what should you put there?
seq returns its second argument without doing anything to it. As a side-effect, it also evaluates (shallowly) its first argument. So, first argument should be what you want to be evaluated, second is what you want seq to return. Note that e `seq` e is useless; it does *not* force the evaluation of e before it would be evaluated in any case.

Eugene Crosser wrote:
Anyway, I understand that you used 'seq' in your example as a way to "strictify" the function that updates accumulator. Could you (or anyone) explain (in plain English, preferably:) the reason why 'seq' is the way it is. In the first place, why does it have the first argument at all
If you write 'seq a b' it means: "Should you need to evaluate b, evaluate (the top constructor of) a first." The example at hand was something like update' key value map = let value' = lookupWithDefault 0 key map in value' `seq` insert key value' map We assume that your program will somehow demand the final value of the map involved, so the 'insert ...' expression will be evaluated at some point. Due to lazy semantics that doesn't mean that value' is evaluated, instead an unevaluated thunk is put into the map to be evaluated once you look it up. Since it is this thunk which takes up all the space, we have to make sure it is evaluated eagerly. That's what the 'seq' does: if evaluation of the map is demanded, value' has to be evaluated before. Notice that there is an application of seq inside of foldl', too. Foldl would build an expression like this: ( insert kn vn ( ... ( insert k2 v2 ( insert k1 v1 empty ) ) ... ) ) Nothing demands the evaluation of the deeply nested part. Foldl' places seq at the appropriate places, so evaluation progresses from the inside out, which is exactly what you need. If you mistakenly used foldl, the 'seq' in the update function would never be triggered. (A single forgotten 'seq' can sometimes ruin everything. This makes "sprinkling seqs until it works" quite frustrating.)
and what should you put there?
I wish I had a good rule of thumb here. Accumulators are a good candidate, the things deep in data structures are good, too, and heap profiling might point you at the right place.
Still, it consumes 20 times more CPU...
Well, that's probably the result of strings being represented as linked lists of unicode characters and Data.Map not being tailored to structured keys. You can make your code faster if you don't care that it gets uglier. Udo. -- If you're not making waves, you're not underway - Adm. Nimitz

On 5/14/06, Eugene Crosser
main = printMax . (foldr processLine empty) . lines =<< getContents [snip] The thing kinda works on small data sets, but if you feed it with 250,000 lines (1000 distinct), the process size grows to 200 Mb, and on 500,000 lines I get "*** Exception: stack overflow" (using runhaskell from ghc 6.2.4).
To elaborate on Udo's point: If you look at the definition of foldr you'll see where the stack overflow is coming from: foldr recurses all the way down to the end of the list, so your stack gets 250k (or attempts 500k) entries deep so it can process the last line in the file first, then unwinds.

martine:
On 5/14/06, Eugene Crosser
wrote: main = printMax . (foldr processLine empty) . lines =<< getContents [snip] The thing kinda works on small data sets, but if you feed it with 250,000 lines (1000 distinct), the process size grows to 200 Mb, and on 500,000 lines I get "*** Exception: stack overflow" (using runhaskell from ghc 6.2.4).
To elaborate on Udo's point: If you look at the definition of foldr you'll see where the stack overflow is coming from: foldr recurses all the way down to the end of the list, so your stack gets 250k (or attempts 500k) entries deep so it can process the last line in the file first, then unwinds.
Also, don't use runhaskell! Compile the code with -O :) -- Don

dons:
martine:
On 5/14/06, Eugene Crosser
wrote: main = printMax . (foldr processLine empty) . lines =<< getContents [snip] The thing kinda works on small data sets, but if you feed it with 250,000 lines (1000 distinct), the process size grows to 200 Mb, and on 500,000 lines I get "*** Exception: stack overflow" (using runhaskell from ghc 6.2.4).
To elaborate on Udo's point: If you look at the definition of foldr you'll see where the stack overflow is coming from: foldr recurses all the way down to the end of the list, so your stack gets 250k (or attempts 500k) entries deep so it can process the last line in the file first, then unwinds.
Also, don't use runhaskell! Compile the code with -O :)
Not sure what processLine does, but just trying out Data.ByteString on this as a test:
import qualified Data.ByteString.Char8 as B import Data.List
main = print . foldl' processLine 0 . B.lines =<< B.getContents where processLine acc l = if B.length l > 10 then acc+1 else acc
Just count the long lines. Probably you do something fancier. Anyway, 32M runs through this in: $ time ./a.out < /home/dons/fps/tests/32M 470400 ./a.out < /home/dons/fps/tests/32M 0.31s user 0.28s system 28% cpu 2.082 total with 32M heap (these are strict byte arrays). Using Data.ByteString.Lazy:
import qualified Data.ByteString.Lazy as B import Data.List
main = print . foldl' processLine 0 . B.split 10 =<< B.getContents where processLine acc l = if B.length l > 10 then acc+1 else acc
$ time ./a.out < /home/dons/fps/tests/32M 470400 ./a.out < /home/dons/fps/tests/32M 0.32s user 0.11s system 26% cpu 1.592 total With only 3M heap used. -- Don
participants (5)
-
Antti-Juhani Kaijanaho
-
dons@cse.unsw.edu.au
-
Eugene Crosser
-
Evan Martin
-
Udo Stenzel