Program with ByteStrings leads to memory exhaust.

I have a simple program that first generates a large (~ 500 mb) file of random numbers and then reads the numbers back to find their sum. It uses Data.Binary and Data.ByteString.Lazy. The problem is when the program tries to read the data back it quickly (really quickly) consumes all memory. The source: http://moonpatio.com/fastcgi/hpaste.fcgi/view?id=3607#a3607 or: module Main where import Data.Binary import Data.Int import System.Random import qualified Data.ByteString.Lazy as BL encodeFileAp f = BL.appendFile f . encode path = "Results.data" n = 20*1024*1024 :: Int getBlockSize :: BL.ByteString -> Int64 getBlockSize bs = round $ (fromIntegral $ BL.length bs) / (fromIntegral n) fillFile :: StdGen -> Int -> IO () fillFile _ 0 =return () fillFile gen i = do let (x, gen') = random gen :: (Double, StdGen) encodeFileAp path x fillFile gen' (i-1) processFile :: BL.ByteString -> Int64 -> Int -> Double -> Double processFile bs blockSize 0 sum = sum processFile bs blockSize i sum = let tmpTuple = BL.splitAt blockSize bs x = decode $ fst $! tmpTuple in processFile (snd tmpTuple) blockSize (i-1) $! sum + x main = do fillFile (mkStdGen 42) n results <- BL.readFile path putStrLn $ show $ processFile results (getBlockSize results) n 0

Excerpts from Grigory Sarnitskiy's message of Mon Sep 14 16:05:41 +0200 2009:
I have a simple program that first generates a large (~ 500 mb) file of random numbers and then reads the numbers back to find their sum. It uses Data.Binary and Data.ByteString.Lazy.
I do think that this is due to splitAt but I didn't made a deep investigation. -- Nicolas Pouillard http://nicolaspouillard.fr

sargrigory:
I have a simple program that first generates a large (~ 500 mb) file of random numbers and then reads the numbers back to find their sum. It uses Data.Binary and Data.ByteString.Lazy.
The problem is when the program tries to read the data back it quickly (really quickly) consumes all memory.
The source: http://moonpatio.com/fastcgi/hpaste.fcgi/view?id=3607#a3607
I have tweaked this program a few ways for you. The big mistake (and why it runs out of space) is that you take ByteString.Lazy.length to compute the block size. This forces the entire file into memory -- so no benefits of lazy IO. As a separate matter, calling 'appendFile . encode' incrementally for each element will be very slow. Much faster to encode an entire list in one go. Finally, using System.Random.Mersenne is significantly faster at Double generation that System.Random. With these changes (below), your program runs in constant space (both writing out and reading in the 0.5Gb file), and is much faster: {-# LANGUAGE BangPatterns #-} import Data.Binary.Put import Data.Binary import System.IO import Data.Int import qualified Data.ByteString.Lazy as BL import System.Random.Mersenne path = "Results.data" n = 20*1024*1024 :: Int -- getBlockSize :: BL.ByteString -> Int64 -- getBlockSize bs = round $ (fromIntegral $ BL.length bs) / (fromIntegral n) -- -- ^^^^^ why do you take the length!? -- -- there's no point doing lazy IO then. -- Custom serialization (no length prefix) fillFile n = do g <- newMTGen (Just 42) rs <- randoms g :: IO [Double] BL.writeFile path $ runPut $ mapM_ put (take n rs) -- fillFile :: MTGen -> Int -> IO () -- fillFile _ 0 = return () -- fillFile g i = do -- x <- random g :: IO Double -- encodeFileAp path x -- fillFile g (i-1) processFile :: BL.ByteString -> Int64 -> Int -> Double -> Double processFile !bs !blockSize 0 !sum = sum processFile bs blockSize i sum = processFile y blockSize (i-1) (sum + decode x) where (x,y) = BL.splitAt blockSize bs main = do fillFile n -- compute the size without loading the file into memory h <- openFile path ReadMode sz <- hFileSize h hClose h results <- BL.readFile path let blockSize = round $ fromIntegral sz / fromIntegral n print $ processFile results blockSize n 0 ------------------------------------------------------------------------ Running this : $ ./A +RTS -sstderr 1.0483476019172292e7 226,256,100,448 bytes allocated in the heap 220,413,096 bytes copied during GC 65,416 bytes maximum residency (1186 sample(s)) 136,376 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) ^^^^^^^^^^^^^^^^^ It now runs in constant space. Generation 0: 428701 collections, 0 parallel, 3.17s, 3.49s elapsed Generation 1: 1186 collections, 0 parallel, 0.13s, 0.16s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 118.26s (129.19s elapsed) GC time 3.30s ( 3.64s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 121.57s (132.83s elapsed) %GC time 2.7% (2.7% elapsed) ^^^^^^^^^^^^^^^^ Does very little GC. Alloc rate 1,913,172,101 bytes per MUT second Productivity 97.3% of total user, 89.0% of total elapsed -- Don

I have tweaked this program a few ways for you. The big mistake (and why it runs out of space) is that you take ByteString.Lazy.length to compute the block size. This forces the entire file into memory -- so no benefits of lazy IO. As a separate matter, calling 'appendFile . encode' incrementally for each element will be very slow. Much faster to encode an entire list in one go. Finally, using System.Random.Mersenne is significantly faster at Double generation that System.Random.
Thank you! Just excellent! // I'm so happy :-)
participants (3)
-
Don Stewart
-
Grigory Sarnitskiy
-
Nicolas Pouillard