
sargrigory:
I have a simple program that first generates a large (~ 500 mb) file of random numbers and then reads the numbers back to find their sum. It uses Data.Binary and Data.ByteString.Lazy.
The problem is when the program tries to read the data back it quickly (really quickly) consumes all memory.
The source: http://moonpatio.com/fastcgi/hpaste.fcgi/view?id=3607#a3607
I have tweaked this program a few ways for you. The big mistake (and why it runs out of space) is that you take ByteString.Lazy.length to compute the block size. This forces the entire file into memory -- so no benefits of lazy IO. As a separate matter, calling 'appendFile . encode' incrementally for each element will be very slow. Much faster to encode an entire list in one go. Finally, using System.Random.Mersenne is significantly faster at Double generation that System.Random. With these changes (below), your program runs in constant space (both writing out and reading in the 0.5Gb file), and is much faster: {-# LANGUAGE BangPatterns #-} import Data.Binary.Put import Data.Binary import System.IO import Data.Int import qualified Data.ByteString.Lazy as BL import System.Random.Mersenne path = "Results.data" n = 20*1024*1024 :: Int -- getBlockSize :: BL.ByteString -> Int64 -- getBlockSize bs = round $ (fromIntegral $ BL.length bs) / (fromIntegral n) -- -- ^^^^^ why do you take the length!? -- -- there's no point doing lazy IO then. -- Custom serialization (no length prefix) fillFile n = do g <- newMTGen (Just 42) rs <- randoms g :: IO [Double] BL.writeFile path $ runPut $ mapM_ put (take n rs) -- fillFile :: MTGen -> Int -> IO () -- fillFile _ 0 = return () -- fillFile g i = do -- x <- random g :: IO Double -- encodeFileAp path x -- fillFile g (i-1) processFile :: BL.ByteString -> Int64 -> Int -> Double -> Double processFile !bs !blockSize 0 !sum = sum processFile bs blockSize i sum = processFile y blockSize (i-1) (sum + decode x) where (x,y) = BL.splitAt blockSize bs main = do fillFile n -- compute the size without loading the file into memory h <- openFile path ReadMode sz <- hFileSize h hClose h results <- BL.readFile path let blockSize = round $ fromIntegral sz / fromIntegral n print $ processFile results blockSize n 0 ------------------------------------------------------------------------ Running this : $ ./A +RTS -sstderr 1.0483476019172292e7 226,256,100,448 bytes allocated in the heap 220,413,096 bytes copied during GC 65,416 bytes maximum residency (1186 sample(s)) 136,376 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) ^^^^^^^^^^^^^^^^^ It now runs in constant space. Generation 0: 428701 collections, 0 parallel, 3.17s, 3.49s elapsed Generation 1: 1186 collections, 0 parallel, 0.13s, 0.16s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 118.26s (129.19s elapsed) GC time 3.30s ( 3.64s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 121.57s (132.83s elapsed) %GC time 2.7% (2.7% elapsed) ^^^^^^^^^^^^^^^^ Does very little GC. Alloc rate 1,913,172,101 bytes per MUT second Productivity 97.3% of total user, 89.0% of total elapsed -- Don