
Am Freitag, 27. Februar 2009 23:18 schrieb Rogan Creswick:
\begin{code} -- Compiled with: -- $ ghc --make offsetSorter.hs -- -- (ghc v. 6.8.2) -- -- Run with: -- $ time ./offsetSorter data/byteOffsets.txt > haskOffsets.txt -- offsetSorter: out of memory (requested 1048576 bytes) -- -- real 4m12.130s -- user 3m4.812s -- sys 0m5.660s --(OOM happened after consuming just over 3000mb of Virt, 2.6gb Res, according to top.) --
import System (getArgs) import Data.Maybe import Monad import Text.Printf (printf) import Data.Function (on) import Data.List (sort) import Data.ByteString (ByteString) import qualified Data.ByteString.Char8 as C8 import qualified Data.ByteString as B
-- get the lines -- parse each line to get the offset. -- scan the list of offsets
-- | The full file size: maxSize :: Integer maxSize = 2807444044080
-- | Block is a contiguous chunk of data. -- The first entry is the offset, the second is the length. data Block = Block { offset::Integer , size::Integer } deriving (Eq)
-- | Ordering of Blocks is based entirely on the block size. instance Ord Block where compare = compare `on` size
instance Show Block where show (Block o s) = (show o) ++ " " ++ (show s)
-- turn the file into a list of offsets: getOffsets :: ByteString -> [Integer] getOffsets = catMaybes . map parseOffset . C8.lines
-- | Pull out the offsets frome a line of the file. parseOffset :: ByteString -> Maybe Integer parseOffset s = do (i, _) <- C8.readInteger (C8.filter (/=':') s)
Why the C8.filter (/= ':')? That just costs and doesn't help anything (in fact, if your file contains lines like 1234:5678, it gives wrong results). If you know that your file contains only lines of the form offset: <page>, you can have getOffsets = map (fst . fromJust . C8.readInteger) . C8.lines which seems to do a little good.
Just i
-- | Get the offsets between entries in a list getSizes :: [Integer] -> [Integer] getSizes (x:y:[]) = [y - x] getSizes (x:y:ys) = (y - x):(getSizes (y:ys))
-- | creates and returns a list of Blocks, given a file's content. blocks :: ByteString -> [Block] blocks s = zipWith (Block) offsets sizes where offsets = getOffsets s sizes = getSizes (offsets ++ [maxSize])
main :: IO () main = do args <- getArgs content <- B.readFile (args!!0)
printf "%s" $ unlines $ map (show) (sort $! blocks content)
Bad! Use mapM_ print $ sort $ blocks content
\end{code}