Re: [Haskell-cafe] memory issues

27 Feb 2009


      Am Freitag, 27. Februar 2009 23:18 schrieb Rogan Creswick:
...
\begin{code}
-- Compiled with:
-- $ ghc --make offsetSorter.hs
--
--  (ghc v. 6.8.2)
--
-- Run with:
-- $ time ./offsetSorter data/byteOffsets.txt > haskOffsets.txt
-- offsetSorter: out of memory (requested 1048576 bytes)
--
-- real	4m12.130s
-- user	3m4.812s
-- sys	0m5.660s
--(OOM happened after consuming just over 3000mb of Virt, 2.6gb Res,
according to top.)
--
import System (getArgs)
import Data.Maybe
import Monad
import Text.Printf (printf)
import Data.Function (on)
import Data.List (sort)
import Data.ByteString (ByteString)
import qualified Data.ByteString.Char8 as C8
import qualified Data.ByteString as B
-- get the lines
-- parse each line to get the offset.
-- scan the list of offsets
-- | The full file size:
maxSize :: Integer
maxSize = 2807444044080
-- | Block is a contiguous chunk of data.
-- The first entry is the offset, the second is the length.
data Block = Block {
      offset::Integer
    , size::Integer
    } deriving (Eq)
-- | Ordering of Blocks is based entirely on the block size.
instance Ord Block where
    compare = compare `on` size
instance Show Block where
    show (Block o s) = (show o) ++ "  " ++ (show s)
-- turn the file into a list of offsets:
getOffsets :: ByteString -> [Integer]
getOffsets = catMaybes . map parseOffset . C8.lines
-- | Pull out the offsets frome a line of the file.
parseOffset :: ByteString -> Maybe Integer
parseOffset s = do
  (i, _) <- C8.readInteger (C8.filter (/=':') s)
Why the C8.filter (/= ':')?
That just costs and doesn't help anything (in fact, if your file contains 
lines like 1234:5678, it gives wrong results).

If you know that your file contains only lines of the form offset: <page>,
you can have

getOffsets = map (fst . fromJust . C8.readInteger) . C8.lines

which seems to do a little good.
...
Just i
-- | Get the offsets between entries in a list
getSizes :: [Integer]  -> [Integer]
getSizes (x:y:[]) = [y - x]
getSizes (x:y:ys) = (y - x):(getSizes (y:ys))
-- | creates and returns a list of Blocks, given a file's content.
blocks :: ByteString -> [Block]
blocks s = zipWith (Block) offsets sizes
           where offsets = getOffsets s
                 sizes   = getSizes (offsets ++ [maxSize])
main :: IO ()
main = do
  args <- getArgs
  content <- B.readFile (args!!0)
...
printf "%s" $ unlines $ map (show) (sort $! blocks content)
Bad!
Use
	mapM_ print $ sort $ blocks content
...
\end{code}

Re: [Haskell-cafe] memory issues

Daniel Fischer