
First off, my apologies for breaking etiquette, if/when I do -- I've only just joined Haskell-cafe, and I'm quite new to Haskell. I have recently been trying to process a large data set (the 2.8tb wikipedia data dump), and also replace my scripting needs with haskell (needs that have previously been filled with bash, perl, and bits of Java). Last week I needed to do some quick scanning of the (7zipped) wikipedia dump to get a feel for the size of articles, and from that determine the best way to process the whole enchilada... cutting to the chase, I ended up with a file consisting of byte offsets and lines matched by a grep pattern (a 250mb file). Specifically, 11m lines of: 1405: <page> 14062: <page> 15979: <page> 18665: <page> 920680797: <page> ...... 2807444041476: <page> 2807444043623: <page> I needed to know how large the lagest <page> elements were, so I'd know if they would fit in memory, and some idea of how many would cause swapping, etc. So, I wrote a simple app in haskell (below) to find the sizes of each <page> and sort them. Unfortunately, it consumes an absurd amount of memory (3+gb) and dies with an out-of-memory error. Given the input size, and what it is doing, this seems ridiculously high -- can anyone help me understand what is going on, and how I can prevent this sort of rampant memory use? I can provide a link to the input file if anyone wants it, but it doesn't seem particularly useful, given the simplicity and size. Since I needed to get results fairly quickly, I've re-implemented this in java, so that reference implementation is also available should anyone want it (the approach that is most similar to the haskell requires a 1.4gb heap, but by streaming the string->long parsing, that requirement drops to ~600mb, which seems pretty reasonable, since the *output* is 215mb.) Thanks! Rogan \begin{code} -- Compiled with: -- $ ghc --make offsetSorter.hs -- -- (ghc v. 6.8.2) -- -- Run with: -- $ time ./offsetSorter data/byteOffsets.txt > haskOffsets.txt -- offsetSorter: out of memory (requested 1048576 bytes) -- -- real 4m12.130s -- user 3m4.812s -- sys 0m5.660s --(OOM happened after consuming just over 3000mb of Virt, 2.6gb Res, according to top.) -- import System (getArgs) import Data.Maybe import Monad import Text.Printf (printf) import Data.Function (on) import Data.List (sort) import Data.ByteString (ByteString) import qualified Data.ByteString.Char8 as C8 import qualified Data.ByteString as B -- get the lines -- parse each line to get the offset. -- scan the list of offsets -- | The full file size: maxSize :: Integer maxSize = 2807444044080 -- | Block is a contiguous chunk of data. -- The first entry is the offset, the second is the length. data Block = Block { offset::Integer , size::Integer } deriving (Eq) -- | Ordering of Blocks is based entirely on the block size. instance Ord Block where compare = compare `on` size instance Show Block where show (Block o s) = (show o) ++ " " ++ (show s) -- turn the file into a list of offsets: getOffsets :: ByteString -> [Integer] getOffsets = catMaybes . map parseOffset . C8.lines -- | Pull out the offsets frome a line of the file. parseOffset :: ByteString -> Maybe Integer parseOffset s = do (i, _) <- C8.readInteger (C8.filter (/=':') s) Just i -- | Get the offsets between entries in a list getSizes :: [Integer] -> [Integer] getSizes (x:y:[]) = [y - x] getSizes (x:y:ys) = (y - x):(getSizes (y:ys)) -- | creates and returns a list of Blocks, given a file's content. blocks :: ByteString -> [Block] blocks s = zipWith (Block) offsets sizes where offsets = getOffsets s sizes = getSizes (offsets ++ [maxSize]) main :: IO () main = do args <- getArgs content <- B.readFile (args!!0) printf "%s" $ unlines $ map (show) (sort $! blocks content) \end{code}