
creswick:
First off, my apologies for breaking etiquette, if/when I do -- I've only just joined Haskell-cafe, and I'm quite new to Haskell.
I have recently been trying to process a large data set (the 2.8tb wikipedia data dump), and also replace my scripting needs with haskell (needs that have previously been filled with bash, perl, and bits of Java). Last week I needed to do some quick scanning of the (7zipped) wikipedia dump to get a feel for the size of articles, and from that determine the best way to process the whole enchilada... cutting to the chase, I ended up with a file consisting of byte offsets and lines matched by a grep pattern (a 250mb file). Specifically, 11m lines of:
1405: <page> 14062: <page> 15979: <page> 18665: <page> 920680797: <page> ...... 2807444041476: <page> 2807444043623: <page>
I needed to know how large the lagest <page> elements were, so I'd know if they would fit in memory, and some idea of how many would cause swapping, etc. So, I wrote a simple app in haskell (below) to find the sizes of each <page> and sort them. Unfortunately, it consumes an absurd amount of memory (3+gb) and dies with an out-of-memory error. Given the input size, and what it is doing, this seems ridiculously high -- can anyone help me understand what is going on, and how I can prevent this sort of rampant memory use?
I can provide a link to the input file if anyone wants it, but it doesn't seem particularly useful, given the simplicity and size. Since I needed to get results fairly quickly, I've re-implemented this in java, so that reference implementation is also available should anyone want it (the approach that is most similar to the haskell requires a 1.4gb heap, but by streaming the string->long parsing, that requirement drops to ~600mb, which seems pretty reasonable, since the *output* is 215mb.)
Thanks! Rogan
\begin{code} -- Compiled with: -- $ ghc --make offsetSorter.hs
YIKES!! Use the optimizer! ghc -O2 --make -- Don