space leak processing multiple compressed files

Hi everyone, I have a collection of bzipped files. Each file has a different number of items per line, with a separator between them. What I want to do is count the items in each file. I'm trying to read the files lazily but I seem to be running out of memory. I'm assuming I'm holding onto resources longer than I need to. Does anyone have any advice on how to improve this? Here's the basic program, slightly sanitized: main = do -- get a list of file names filelist <- getFileList "testsetdir" -- process each compressed file files <- mapM (\x -> do thisfile <- B.readFile x return (Z.decompress thisfile) ) filelist display $ processEntries files putStrLn "finished" -- processEntries -- processEntries is defined elsewhere, but basically does some string processing per line, -- counts the number of resulting elements and sums them per file processEntries :: [B.ByteString] -> Int processEntries xs = foldl' (\x y -> x + processEntries (B.lines y)) 0 xs -- display a field that returns a number display :: Int -> IO () display = putStrLn . show

On Tue, Sep 4, 2012 at 11:00 AM, Ian Knopke
main = do
-- get a list of file names filelist <- getFileList "testsetdir"
-- process each compressed file files <- mapM (\x -> do thisfile <- B.readFile x return (Z.decompress thisfile) ) filelist
display $ processEntries files
putStrLn "finished"
-- processEntries -- processEntries is defined elsewhere, but basically does some string processing per line, -- counts the number of resulting elements and sums them per file processEntries :: [B.ByteString] -> Int processEntries xs = foldl' (\x y -> x + processEntries (B.lines y)) 0 xs
The problem seems to be your `processEntries` function: it is recursively defined, and as far as I understand, it's never going to end because "y" (inside the lambda function) is always going to be the full list of files (xs). Probably, `processEntries` should be something like: processEntries = foldl' (\acc fileContent -> acc + processFileContent fileContent) 0 processFileContent :: B.ByteString -> Int processFileContent = -- count what you have to, in a file In fact, processEntries could be rewritten without using foldl': processEntries = sum . map processFileContent hth, L.

Hi Lorenzo,
You're correct. Well spotted! I must have created that doing some copy
and paste. The program is basically as you suggested it. Here's a
corrected version:
main = do
-- get a list of file names
filelist <- getFileList "testsetdir"
-- process each compressed file
files <- mapM (\x -> do
thisfile <- B.readFile x
return (Z.decompress thisfile)
) filelist
display $ processEntries files
putStrLn "finished"
-- processEntries
-- processEntries is defined elsewhere, but basically does some string
-- processing per line, counts the number of resulting elements and
sums them per file
processEntries :: [B.ByteString] -> Int
processEntries xs = foldl' (\x y -> x + countItems (B.lines y)) 0 xs
I'm still running into memory issues though. I think it's the mapM
loop above and that each file is not being released after reading
through it. Does that seem reasonable, and is there any way to write
this better?
Ian
... and countItems uses foldl'
On Tue, Sep 4, 2012 at 1:55 PM, Lorenzo Bolla
On Tue, Sep 4, 2012 at 11:00 AM, Ian Knopke
wrote: main = do
-- get a list of file names filelist <- getFileList "testsetdir"
-- process each compressed file files <- mapM (\x -> do thisfile <- B.readFile x return (Z.decompress thisfile) ) filelist
display $ processEntries files
putStrLn "finished"
-- processEntries -- processEntries is defined elsewhere, but basically does some string processing per line, -- counts the number of resulting elements and sums them per file processEntries :: [B.ByteString] -> Int processEntries xs = foldl' (\x y -> x + processEntries (B.lines y)) 0 xs
The problem seems to be your `processEntries` function: it is recursively defined, and as far as I understand, it's never going to end because "y" (inside the lambda function) is always going to be the full list of files (xs).
Probably, `processEntries` should be something like:
processEntries = foldl' (\acc fileContent -> acc + processFileContent fileContent) 0
processFileContent :: B.ByteString -> Int processFileContent = -- count what you have to, in a file
In fact, processEntries could be rewritten without using foldl': processEntries = sum . map processFileContent
hth, L.

You might want to look at conduits if you need deterministic and prompt
finalisation. I would sketch out a solution but I have only my phone.
On Sep 4, 2012 2:36 PM, "Ian Knopke"
Hi Lorenzo,
You're correct. Well spotted! I must have created that doing some copy and paste. The program is basically as you suggested it. Here's a corrected version:
main = do
-- get a list of file names filelist <- getFileList "testsetdir"
-- process each compressed file files <- mapM (\x -> do thisfile <- B.readFile x return (Z.decompress thisfile) ) filelist
display $ processEntries files
putStrLn "finished"
-- processEntries -- processEntries is defined elsewhere, but basically does some string -- processing per line, counts the number of resulting elements and sums them per file processEntries :: [B.ByteString] -> Int processEntries xs = foldl' (\x y -> x + countItems (B.lines y)) 0 xs
I'm still running into memory issues though. I think it's the mapM loop above and that each file is not being released after reading through it. Does that seem reasonable, and is there any way to write this better?
Ian
... and countItems uses foldl' On Tue, Sep 4, 2012 at 1:55 PM, Lorenzo Bolla
wrote: On Tue, Sep 4, 2012 at 11:00 AM, Ian Knopke
wrote: main = do
-- get a list of file names filelist <- getFileList "testsetdir"
-- process each compressed file files <- mapM (\x -> do thisfile <- B.readFile x return (Z.decompress thisfile) ) filelist
display $ processEntries files
putStrLn "finished"
-- processEntries -- processEntries is defined elsewhere, but basically does some string processing per line, -- counts the number of resulting elements and sums them per file processEntries :: [B.ByteString] -> Int processEntries xs = foldl' (\x y -> x + processEntries (B.lines y)) 0 xs
The problem seems to be your `processEntries` function: it is recursively defined, and as far as I understand, it's never going to end because "y" (inside the lambda function) is always going to be the full list of files (xs).
Probably, `processEntries` should be something like:
processEntries = foldl' (\acc fileContent -> acc + processFileContent fileContent) 0
processFileContent :: B.ByteString -> Int processFileContent = -- count what you have to, in a file
In fact, processEntries could be rewritten without using foldl': processEntries = sum . map processFileContent
hth, L.
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners
participants (4)
-
Benjamin Edwards
-
Ian Knopke
-
Lorenzo Bolla
-
Michael Orlitzky