Re: [Haskell-cafe] Space leak

Hello Justin,
I tried and what I saw was a constant increase in memory usage.
Any particular profiling option that you would use?
I do remember that there was a particular option in which the leak would
dissapear (for the same amount of work) and that is why I stopped using the
profiler.
Thanks,
Arnoldo
On Wed, Mar 10, 2010 at 10:20 PM, Justin Bailey
Have you use the profiling tools available with GHC?
http://haskell.org/ghc/docs/latest/html/users_guide/profiling.html
Hello,
I am learning haskell and I found a space leak that I find difficult to solve. I've been asking at #haskell but we could not solve the issue.
I want to lazily read a set of 22 files of about 200MB each, filter them and then I want to output the result into a unique file. If I modify the main function to work only with one input file, the
On Wed, Mar 10, 2010 at 12:45 PM, Arnoldo Muller
wrote: program runs without issues. I will call this version A. Version B uses a mapM_ to iterate over a list of filenames and uses appendFile to output the result of filtering each file. In this case the memory usage grows sharply and quickly (profiles show constant memory growth). In less than a minute, memory occupation will make my system hang with swapping.
This is version B:
------------------------------- Program B
--------------------------------------------------------------------------------------------------------------------
import Data.List import System.Environment import System.Directory import Control.Monad
-- different types of chromosomes data Chromosome = C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C12 | C13 | C14 | C15 | C16 | C17 | C18 | C19 | CX | CY | CMT deriving (Show) -- define a window type Sequence = [Char] -- Window data data Window = Window { sequen :: Sequence, chrom :: Chromosome, pos :: Int } -- print a window instance Show Window where show w = (sequen w) ++ "\t" ++ show (chrom w) ++ "\t" ++ show (pos w)
-- Reading fasta files with haskell
-- Initialize the main = do -- get the arguments (intput is [input, output, windowSize] <- getArgs -- get directory contents (only names) names <- getDirectoryContents input -- prepend directory let fullNames = filter isFastaFile $ map (\x -> input ++ "/" ++ x) names let wSize = (read windowSize)::Int -- process the directories mapM (genomeExecute output wSize filterWindow) fullNames
-- read the files one by one and write them to the output file genomeExecute :: String -> Int -> (Window -> Bool) -> String -> IO () genomeExecute outputFile windowSize f inputFile = do fileData <- readFile inputFile appendFile outputFile $ fastaExtractor fileData windowSize f
-- isFastaFile :: String -> Bool isFastaFile fileName = isSuffixOf ".fa" fileName
-- fasta extractor (receives a Fasta String and returns a windowed string ready to be sorted) -- an example on how to compose several functions to parse a fasta file fastaExtractor :: String -> Int -> (Window -> Bool) -> String fastaExtractor input wSize f = printWindowList $ filter f $ readFasta wSize input
-- MAIN FILTER that removes N elements from the strings! filterWindow :: Window -> Bool filterWindow w = not (elem 'N' (sequen w))
-- print a window list (the printing makes it ready for output as raw data) printWindowList :: [Window] -> String printWindowList l = unlines $ map show l
-- read fasta, remove stuff that is not useful from it -- removes the readFasta :: Int -> [Char] -> [Window] readFasta windowSize sequence = -- get the header let (header:rest) = lines sequence chr = parseChromosome header in
-- We now do the following: -- take window create counter remove newlines map (\(i, w) -> Window w chr i) $ zip [0..] $ slideWindow windowSize $ filter ( '\n' /= ) $ unlines rest
slideWindow :: Int -> [Char] -> [[Char]] slideWindow _ [] = [] slideWindow windowSize l@(_:xs) = take windowSize l : slideWindow windowSize xs
-- Parse the chromosome from a fasta comment -- produce a more compact chromosome representation parseChromosome :: [Char] -> Chromosome parseChromosome line | isInfixOf "chromosome 1," line = C1 | isInfixOf "chromosome 2," line = C2 | isInfixOf "chromosome 3," line = C3 | isInfixOf "chromosome 4," line = C4 | isInfixOf "chromosome 5," line = C5 | isInfixOf "chromosome 6," line = C6 | isInfixOf "chromosome 7," line = C7 | isInfixOf "chromosome 8," line = C9 | isInfixOf "chromosome 10," line = C10 | isInfixOf "chromosome 11," line = C11 | isInfixOf "chromosome 12," line = C12 | isInfixOf "chromosome 13," line = C13 | isInfixOf "chromosome 14," line = C14 | isInfixOf "chromosome 15," line = C15 | isInfixOf "chromosome 16," line = C16 | isInfixOf "chromosome 17" line = C17 | isInfixOf "chromosome 18" line = C18 | isInfixOf "chromosome 19" line = C19 | isInfixOf "chromosome X" line = CX | isInfixOf "chromosome Y" line = CY | isInfixOf "mitochondrion" line = CMT | otherwise = error "BAD header"
-------------------------------- End of program B
------------------------------------------------------------------------------------------------
-------------------------------- Program A
---------------------------------------------------------------------------------------------------------
If instead of the main function given above I use the following main function to process only one input file, things work OK for even the largest files. Memory usage remains constant in this case.
main = do -- get the arguments [input, output, windowSize] <- getArgs -- keep the input stream inpStr <- readFile input let wSize = (read windowSize)::Int writeFile output $ fastaExtractor inpStr wSize filterWindow
It is not easy for me to see why is Haskell keeping data in memory. Do you have any idea why program B is not working?
Thank you for your help!
Arnoldo Muller
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Wed, Mar 10, 2010 at 2:03 PM, Arnoldo Muller
Hello Justin,
I tried and what I saw was a constant increase in memory usage. Any particular profiling option that you would use?
A great place to get started with profiling is the chapter in Real-World Haskell: http://book.realworldhaskell.org/read/profiling-and-optimization.html For a problem like this I would look at general heap profiling (-hc), retainer profiling (-hr), and also type profiling (-hy) to see if any of them provide new insight. For example, -hc might tell you which functions are problematic, but -hr is more likely to help you there. Sometimes the specific type that is leaking is not what you think and that's why -hy is nice. I hope that helps, Jason
participants (2)
-
Arnoldo Muller
-
Jason Dagit