Re: [Haskell-cafe] [newbie] processing large logs

14 May 2006

      dons:
...
martine:
...
On 5/14/06, Eugene Crosser  wrote:
...
main = printMax . (foldr processLine empty) . lines =<< getContents
[snip]
The thing kinda works on small data sets, but if you feed it with
250,000 lines (1000 distinct), the process size grows to 200 Mb, and on
500,000 lines I get "*** Exception: stack overflow" (using runhaskell
from ghc 6.2.4).
To elaborate on Udo's point:
If you look at the definition of foldr you'll see where the stack
overflow is coming from:  foldr recurses all the way down to the end
of the list, so your stack gets 250k (or attempts 500k) entries deep
so it can process the last line in the file first, then unwinds.
Also, don't use runhaskell! Compile the code with -O :)
Not sure what processLine does, but just trying out Data.ByteString on
this as a test:
...
import qualified Data.ByteString.Char8 as B
import Data.List
main = print . foldl' processLine 0 . B.lines =<< B.getContents
    where processLine acc l = if B.length l > 10 then acc+1 else acc
Just count the long lines. Probably you do something fancier.

Anyway, 32M runs through this in:

    $ time ./a.out < /home/dons/fps/tests/32M
    470400
    ./a.out < /home/dons/fps/tests/32M  0.31s user 0.28s system 28% cpu
    2.082 total

with 32M heap (these are strict byte arrays).

Using Data.ByteString.Lazy:
...
import qualified Data.ByteString.Lazy as B
import Data.List
main = print . foldl' processLine 0 . B.split 10  =<< B.getContents
    where processLine acc l = if B.length l > 10 then acc+1 else acc
$ time ./a.out < /home/dons/fps/tests/32M
    470400
    ./a.out < /home/dons/fps/tests/32M  0.32s user 0.11s system 26% cpu
    1.592 total

With only 3M heap used.

-- Don

Re: [Haskell-cafe] [newbie] processing large logs

dons＠cse.unsw.edu.au