
dons:
martine:
On 5/14/06, Eugene Crosser
wrote: main = printMax . (foldr processLine empty) . lines =<< getContents [snip] The thing kinda works on small data sets, but if you feed it with 250,000 lines (1000 distinct), the process size grows to 200 Mb, and on 500,000 lines I get "*** Exception: stack overflow" (using runhaskell from ghc 6.2.4).
To elaborate on Udo's point: If you look at the definition of foldr you'll see where the stack overflow is coming from: foldr recurses all the way down to the end of the list, so your stack gets 250k (or attempts 500k) entries deep so it can process the last line in the file first, then unwinds.
Also, don't use runhaskell! Compile the code with -O :)
Not sure what processLine does, but just trying out Data.ByteString on this as a test:
import qualified Data.ByteString.Char8 as B import Data.List
main = print . foldl' processLine 0 . B.lines =<< B.getContents where processLine acc l = if B.length l > 10 then acc+1 else acc
Just count the long lines. Probably you do something fancier. Anyway, 32M runs through this in: $ time ./a.out < /home/dons/fps/tests/32M 470400 ./a.out < /home/dons/fps/tests/32M 0.31s user 0.28s system 28% cpu 2.082 total with 32M heap (these are strict byte arrays). Using Data.ByteString.Lazy:
import qualified Data.ByteString.Lazy as B import Data.List
main = print . foldl' processLine 0 . B.split 10 =<< B.getContents where processLine acc l = if B.length l > 10 then acc+1 else acc
$ time ./a.out < /home/dons/fps/tests/32M 470400 ./a.out < /home/dons/fps/tests/32M 0.32s user 0.11s system 26% cpu 1.592 total With only 3M heap used. -- Don