
First off, my apologies for breaking etiquette, if/when I do -- I've only just joined Haskell-cafe, and I'm quite new to Haskell. I have recently been trying to process a large data set (the 2.8tb wikipedia data dump), and also replace my scripting needs with haskell (needs that have previously been filled with bash, perl, and bits of Java). Last week I needed to do some quick scanning of the (7zipped) wikipedia dump to get a feel for the size of articles, and from that determine the best way to process the whole enchilada... cutting to the chase, I ended up with a file consisting of byte offsets and lines matched by a grep pattern (a 250mb file). Specifically, 11m lines of: 1405: <page> 14062: <page> 15979: <page> 18665: <page> 920680797: <page> ...... 2807444041476: <page> 2807444043623: <page> I needed to know how large the lagest <page> elements were, so I'd know if they would fit in memory, and some idea of how many would cause swapping, etc. So, I wrote a simple app in haskell (below) to find the sizes of each <page> and sort them. Unfortunately, it consumes an absurd amount of memory (3+gb) and dies with an out-of-memory error. Given the input size, and what it is doing, this seems ridiculously high -- can anyone help me understand what is going on, and how I can prevent this sort of rampant memory use? I can provide a link to the input file if anyone wants it, but it doesn't seem particularly useful, given the simplicity and size. Since I needed to get results fairly quickly, I've re-implemented this in java, so that reference implementation is also available should anyone want it (the approach that is most similar to the haskell requires a 1.4gb heap, but by streaming the string->long parsing, that requirement drops to ~600mb, which seems pretty reasonable, since the *output* is 215mb.) Thanks! Rogan \begin{code} -- Compiled with: -- $ ghc --make offsetSorter.hs -- -- (ghc v. 6.8.2) -- -- Run with: -- $ time ./offsetSorter data/byteOffsets.txt > haskOffsets.txt -- offsetSorter: out of memory (requested 1048576 bytes) -- -- real 4m12.130s -- user 3m4.812s -- sys 0m5.660s --(OOM happened after consuming just over 3000mb of Virt, 2.6gb Res, according to top.) -- import System (getArgs) import Data.Maybe import Monad import Text.Printf (printf) import Data.Function (on) import Data.List (sort) import Data.ByteString (ByteString) import qualified Data.ByteString.Char8 as C8 import qualified Data.ByteString as B -- get the lines -- parse each line to get the offset. -- scan the list of offsets -- | The full file size: maxSize :: Integer maxSize = 2807444044080 -- | Block is a contiguous chunk of data. -- The first entry is the offset, the second is the length. data Block = Block { offset::Integer , size::Integer } deriving (Eq) -- | Ordering of Blocks is based entirely on the block size. instance Ord Block where compare = compare `on` size instance Show Block where show (Block o s) = (show o) ++ " " ++ (show s) -- turn the file into a list of offsets: getOffsets :: ByteString -> [Integer] getOffsets = catMaybes . map parseOffset . C8.lines -- | Pull out the offsets frome a line of the file. parseOffset :: ByteString -> Maybe Integer parseOffset s = do (i, _) <- C8.readInteger (C8.filter (/=':') s) Just i -- | Get the offsets between entries in a list getSizes :: [Integer] -> [Integer] getSizes (x:y:[]) = [y - x] getSizes (x:y:ys) = (y - x):(getSizes (y:ys)) -- | creates and returns a list of Blocks, given a file's content. blocks :: ByteString -> [Block] blocks s = zipWith (Block) offsets sizes where offsets = getOffsets s sizes = getSizes (offsets ++ [maxSize]) main :: IO () main = do args <- getArgs content <- B.readFile (args!!0) printf "%s" $ unlines $ map (show) (sort $! blocks content) \end{code}

creswick:
First off, my apologies for breaking etiquette, if/when I do -- I've only just joined Haskell-cafe, and I'm quite new to Haskell.
I have recently been trying to process a large data set (the 2.8tb wikipedia data dump), and also replace my scripting needs with haskell (needs that have previously been filled with bash, perl, and bits of Java). Last week I needed to do some quick scanning of the (7zipped) wikipedia dump to get a feel for the size of articles, and from that determine the best way to process the whole enchilada... cutting to the chase, I ended up with a file consisting of byte offsets and lines matched by a grep pattern (a 250mb file). Specifically, 11m lines of:
1405: <page> 14062: <page> 15979: <page> 18665: <page> 920680797: <page> ...... 2807444041476: <page> 2807444043623: <page>
I needed to know how large the lagest <page> elements were, so I'd know if they would fit in memory, and some idea of how many would cause swapping, etc. So, I wrote a simple app in haskell (below) to find the sizes of each <page> and sort them. Unfortunately, it consumes an absurd amount of memory (3+gb) and dies with an out-of-memory error. Given the input size, and what it is doing, this seems ridiculously high -- can anyone help me understand what is going on, and how I can prevent this sort of rampant memory use?
I can provide a link to the input file if anyone wants it, but it doesn't seem particularly useful, given the simplicity and size. Since I needed to get results fairly quickly, I've re-implemented this in java, so that reference implementation is also available should anyone want it (the approach that is most similar to the haskell requires a 1.4gb heap, but by streaming the string->long parsing, that requirement drops to ~600mb, which seems pretty reasonable, since the *output* is 215mb.)
Thanks! Rogan
\begin{code} -- Compiled with: -- $ ghc --make offsetSorter.hs
YIKES!! Use the optimizer! ghc -O2 --make -- Don

On Fri, Feb 27, 2009 at 2:20 PM, Don Stewart
creswick:
\begin{code} -- Compiled with: -- $ ghc --make offsetSorter.hs
YIKES!! Use the optimizer!
ghc -O2 --make
Ah, that did drastically cut the amount of time it takes to run out of memory (down to 1:23), but unfortunately I can't see any other improvements -- the memory consumed seems to be about the same. (granted, I have no indication of progress -- it may be getting significantly more done, but it's not quite over the hump and producing output yet.) --Rogan

creswick:
On Fri, Feb 27, 2009 at 2:20 PM, Don Stewart
wrote: creswick:
\begin{code} -- Compiled with: -- $ ghc --make offsetSorter.hs
YIKES!! Use the optimizer!
ghc -O2 --make
Ah, that did drastically cut the amount of time it takes to run out of memory (down to 1:23), but unfortunately I can't see any other improvements -- the memory consumed seems to be about the same. (granted, I have no indication of progress -- it may be getting significantly more done, but it's not quite over the hump and producing output yet.)
Ok. Now, profile! (ghc -O2 -prof -auto-all --make) http://book.realworldhaskell.org/read/profiling-and-optimization.html

Hello Don, Saturday, February 28, 2009, 2:18:37 AM, you wrote:
offset :: !Integer
oh yes
And possibly just using {-# UNPACK #-}!Int64 would be ok?
i think that it will be even better but main problem is a huge unevaluated thunks. as the last hope, this may be converted to x <- getOffsets y <- getSizes x z <- sort y also, "zipWith (Block) offsets sizes" and getSizes may be combined to make only one pass through data (although it's not highly important since sort anyway will need them all). really good technique would be to use some sort of heap to hold only 10-100 largest page sizes btw, data Block = Block { size::Integer , offset::Integer } deriving (Ord) allows to omit instance Ord Block definition -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Am Freitag, 27. Februar 2009 23:18 schrieb Rogan Creswick:
\begin{code} -- Compiled with: -- $ ghc --make offsetSorter.hs -- -- (ghc v. 6.8.2) -- -- Run with: -- $ time ./offsetSorter data/byteOffsets.txt > haskOffsets.txt -- offsetSorter: out of memory (requested 1048576 bytes) -- -- real 4m12.130s -- user 3m4.812s -- sys 0m5.660s --(OOM happened after consuming just over 3000mb of Virt, 2.6gb Res, according to top.) --
import System (getArgs) import Data.Maybe import Monad import Text.Printf (printf) import Data.Function (on) import Data.List (sort) import Data.ByteString (ByteString) import qualified Data.ByteString.Char8 as C8 import qualified Data.ByteString as B
-- get the lines -- parse each line to get the offset. -- scan the list of offsets
-- | The full file size: maxSize :: Integer maxSize = 2807444044080
-- | Block is a contiguous chunk of data. -- The first entry is the offset, the second is the length. data Block = Block { offset::Integer , size::Integer } deriving (Eq)
-- | Ordering of Blocks is based entirely on the block size. instance Ord Block where compare = compare `on` size
instance Show Block where show (Block o s) = (show o) ++ " " ++ (show s)
-- turn the file into a list of offsets: getOffsets :: ByteString -> [Integer] getOffsets = catMaybes . map parseOffset . C8.lines
-- | Pull out the offsets frome a line of the file. parseOffset :: ByteString -> Maybe Integer parseOffset s = do (i, _) <- C8.readInteger (C8.filter (/=':') s)
Why the C8.filter (/= ':')? That just costs and doesn't help anything (in fact, if your file contains lines like 1234:5678, it gives wrong results). If you know that your file contains only lines of the form offset: <page>, you can have getOffsets = map (fst . fromJust . C8.readInteger) . C8.lines which seems to do a little good.
Just i
-- | Get the offsets between entries in a list getSizes :: [Integer] -> [Integer] getSizes (x:y:[]) = [y - x] getSizes (x:y:ys) = (y - x):(getSizes (y:ys))
-- | creates and returns a list of Blocks, given a file's content. blocks :: ByteString -> [Block] blocks s = zipWith (Block) offsets sizes where offsets = getOffsets s sizes = getSizes (offsets ++ [maxSize])
main :: IO () main = do args <- getArgs content <- B.readFile (args!!0)
printf "%s" $ unlines $ map (show) (sort $! blocks content)
Bad! Use mapM_ print $ sort $ blocks content
\end{code}

Hello Daniel, Saturday, February 28, 2009, 2:21:31 AM, you wrote:
printf "%s" $ unlines $ map (show) (sort $! blocks content)
Bad! Use mapM_ print $ sort $ blocks content
are you sure? print may waste a lot of time, locking stdout for every line printed $! is really useless here - it will require to compute only *head* of list before calling sort which is absolutely useless :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Am Samstag, 28. Februar 2009 00:37 schrieb Bulat Ziganshin:
Hello Daniel,
Saturday, February 28, 2009, 2:21:31 AM, you wrote:
printf "%s" $ unlines $ map (show) (sort $! blocks content)
Bad! Use mapM_ print $ sort $ blocks content
are you sure?
Tested it. The printf "%s" is very bad. Replacing that reduced allocation and GC time by a factor of 2+. The difference between mapM_ print and putStr $ unlines $ map show $ ... is too small for me to be sure that mapM_ print is really better.
print may waste a lot of time, locking stdout for every line printed
Hm, maybe main = do args <- getArgs content <- B.readFile (args!!0) hout <- openFile (args!!1) WriteMode mapM_ (hPrint hout) $ sort $ blocks content hClose hout ? I find hardly any difference, though.

Hello Daniel, Saturday, February 28, 2009, 3:10:44 AM, you wrote:
print may waste a lot of time, locking stdout for every line printed
hout <- openFile (args!!1) WriteMode mapM_ (hPrint hout) $ sort $ blocks content
? I find hardly any difference, though.
no difference. if handle is locked for every output operation - both versions will be equivalent. if not - they also will be equivalent :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hello Bulat, Am Samstag, 28. Februar 2009 09:38 schrieb Bulat Ziganshin:
Hello Daniel,
Saturday, February 28, 2009, 3:10:44 AM, you wrote:
print may waste a lot of time, locking stdout for every line printed
hout <- openFile (args!!1) WriteMode mapM_ (hPrint hout) $ sort $ blocks content
? I find hardly any difference, though.
no difference. if handle is locked for every output operation - both versions will be equivalent. if not - they also will be equivalent :)
But they would not be equivalent if stdout has to be locked for each output operation separately, but a file opened with openFile fp WriteMode was locked then once and remained so until closed. Since I don't know the internals, I thought it was worth taking a look. Seems they behave the same.

Hello Daniel, Saturday, February 28, 2009, 6:20:09 PM, you wrote:
But they would not be equivalent if stdout has to be locked for each output operation separately, but a file opened with openFile fp WriteMode was locked then once and remained so until closed.
ghc Handles are locked for every operation, not for entire file lifetime. it's done in order to make safe concurrent operations in threads of the same program -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat is right about making Block's fields strict.
-- | Get the offsets between entries in a list getSizes :: [Integer] -> [Integer] getSizes (x:y:[]) = [y - x] getSizes (x:y:ys) = (y - x):(getSizes (y:ys))
You should change the first part to add maxSize:
getSizes :: [Integer] -> [Integer] getSizes (x:y:[]) = [y - x,maxSize] getSizes (x:y:ys) = (y - x):(getSizes (y:ys))
This avoids the ugly use of (++) below. Note that appending to a singly linked list is a bad "code smell":
-- | creates and returns a list of Blocks, given a file's content. blocks :: ByteString -> [Block] blocks s = zipWith (Block) offsets sizes where offsets = getOffsets s sizes = getSizes offsets
main :: IO () main = do args <- getArgs content <- B.readFile (args!!0) printf "%s" $ unlines $ map (show) (sort $! blocks content) \end{code}
I think the printf "%s" should be replaced by putStr or putStrLn. The print is forcing the unlines which forces the map which forces the result of sort. The ($!) is nearly pointless...it forces only the first cons (:) cell in the list. The sort starts comparing the output of blocks by applying compare. The compare forces the snd part of the part from the zipWith, which is the sizes. The size values force the values in the offsets in the fst part of the pair. The fst part of the pair was actually a lazy thunk returned by the C8.readInteger function. But these do not build up since the are indirectly getting forced during the sorting routine. Hmmm....no quick fix. -- Chris
participants (5)
-
Bulat Ziganshin
-
ChrisK
-
Daniel Fischer
-
Don Stewart
-
Rogan Creswick