Re: [Haskell-cafe] memory issues

27 Feb 2009


      creswick:
...
First off, my apologies for breaking etiquette, if/when I do -- I've
only just joined Haskell-cafe, and I'm quite new to Haskell.
I have recently been trying to process a large data set (the 2.8tb
wikipedia data dump), and also replace my scripting needs with haskell
(needs that have previously been filled with bash, perl, and bits of
Java).  Last week I needed to do some quick scanning of the (7zipped)
wikipedia dump to get a feel for the size of articles, and from that
determine the best way to process the whole enchilada... cutting to
the chase, I ended up with a file consisting of byte offsets and lines
matched by a grep pattern (a 250mb file).  Specifically, 11m lines of:
1405:  <page>
14062:  <page>
15979:  <page>
18665:  <page>
920680797:  <page>
......
2807444041476:  <page>
2807444043623:  <page>
I needed to know how large the lagest <page> elements were, so I'd
know if they would fit in memory, and some idea of how many would
cause swapping, etc. So, I wrote a simple app in haskell (below) to
find the sizes of each <page> and sort them.  Unfortunately, it
consumes an absurd amount of memory (3+gb) and dies with an
out-of-memory error.  Given the input size, and what it is doing, this
seems ridiculously high -- can anyone help me understand what is going
on, and how I can prevent this sort of rampant memory use?
I can provide a link to the input file if anyone wants it, but it
doesn't seem particularly useful, given the simplicity and size.
Since I needed to get results fairly quickly, I've re-implemented this
in java, so that reference implementation is also available should
anyone want it (the approach that is most similar to the haskell
requires a 1.4gb heap, but by streaming the string->long parsing, that
requirement drops to ~600mb, which seems pretty reasonable, since the
*output* is 215mb.)
Thanks!
Rogan
\begin{code}
-- Compiled with:
-- $ ghc --make offsetSorter.hs
YIKES!! Use the optimizer!

    ghc -O2 --make


-- Don

Re: [Haskell-cafe] memory issues

Don Stewart