Hello Justin,
I tried and what I saw was a constant increase in memory usage.
Any particular profiling option that you would use?
I do remember that there was a particular option in which the leak would dissapear (for the same amount of work) and that is why I stopped using the profiler.
Thanks,
Arnoldo
Have you use the profiling tools available with GHC?
http://haskell.org/ghc/docs/latest/html/users_guide/profiling.html
> _______________________________________________
On Wed, Mar 10, 2010 at 12:45 PM, Arnoldo Muller
<arnoldomuller@gmail.com> wrote:
> Hello,
>
> I am learning haskell and I found a space leak that I find difficult to
> solve. I've been asking at #haskell but we could not solve
> the issue.
>
> I want to lazily read a set of 22 files of about 200MB each, filter them and
> then I want to output the result into a unique file.
> If I modify the main function to work only with one input file, the program
> runs without issues. I will call this version A.
> Version B uses a mapM_ to iterate over a list of filenames and uses
> appendFile to output the result of filtering each file.
> In this case the memory usage grows sharply and quickly (profiles show
> constant memory growth). In less than a minute, memory
> occupation will make my system hang with swapping.
>
> This is version B:
>
> ------------------------------- Program B
> --------------------------------------------------------------------------------------------------------------------
> import Data.List
> import System.Environment
> import System.Directory
> import Control.Monad
>
>
> -- different types of chromosomes
> data Chromosome = C1
> | C2
> | C3
> | C4
> | C5
> | C6
> | C7
> | C8
> | C9
> | C10
> | C11
> | C12
> | C13
> | C14
> | C15
> | C16
> | C17
> | C18
> | C19
> | CX
> | CY
> | CMT
> deriving (Show)
> -- define a window
> type Sequence = [Char]
> -- Window data
> data Window = Window { sequen :: Sequence,
> chrom :: Chromosome,
> pos :: Int
> }
> -- print a window
> instance Show Window where
> show w = (sequen w) ++ "\t" ++ show (chrom w) ++ "\t" ++ show (pos w)
>
> -- Reading fasta files with haskell
>
> -- Initialize the
> main = do
> -- get the arguments (intput is
> [input, output, windowSize] <- getArgs
> -- get directory contents (only names)
> names <- getDirectoryContents input
> -- prepend directory
> let fullNames = filter isFastaFile $ map (\x -> input ++ "/" ++ x)
> names
> let wSize = (read windowSize)::Int
> -- process the directories
> mapM (genomeExecute output wSize filterWindow) fullNames
>
>
> -- read the files one by one and write them to the output file
> genomeExecute :: String -> Int -> (Window -> Bool) -> String -> IO ()
> genomeExecute outputFile windowSize f inputFile = do
> fileData <- readFile inputFile
> appendFile outputFile $ fastaExtractor fileData windowSize f
>
> --
> isFastaFile :: String -> Bool
> isFastaFile fileName = isSuffixOf ".fa" fileName
>
>
> -- fasta extractor (receives a Fasta String and returns a windowed string
> ready to be sorted)
> -- an example on how to compose several functions to parse a fasta file
> fastaExtractor :: String -> Int -> (Window -> Bool) -> String
> fastaExtractor input wSize f = printWindowList $ filter f $ readFasta wSize
> input
>
> -- MAIN FILTER that removes N elements from the strings!
> filterWindow :: Window -> Bool
> filterWindow w = not (elem 'N' (sequen w))
>
> -- print a window list (the printing makes it ready for output as raw data)
> printWindowList :: [Window] -> String
> printWindowList l = unlines $ map show l
>
> -- read fasta, remove stuff that is not useful from it
> -- removes the
> readFasta :: Int -> [Char] -> [Window]
> readFasta windowSize sequence =
> -- get the header
> let (header:rest) = lines sequence
> chr = parseChromosome header
> in
>
> -- We now do the following:
> -- take window create counter
> remove newlines
> map (\(i, w) -> Window w chr i) $ zip [0..] $ slideWindow windowSize $
> filter ( '\n' /= ) $ unlines rest
>
>
> slideWindow :: Int -> [Char] -> [[Char]]
> slideWindow _ [] = []
> slideWindow windowSize l@(_:xs) = take windowSize l : slideWindow
> windowSize xs
>
>
>
> -- Parse the chromosome from a fasta comment
> -- produce a more compact chromosome representation
> parseChromosome :: [Char] -> Chromosome
> parseChromosome line
> | isInfixOf "chromosome 1," line = C1
> | isInfixOf "chromosome 2," line = C2
> | isInfixOf "chromosome 3," line = C3
> | isInfixOf "chromosome 4," line = C4
> | isInfixOf "chromosome 5," line = C5
> | isInfixOf "chromosome 6," line = C6
> | isInfixOf "chromosome 7," line = C7
> | isInfixOf "chromosome 8," line = C9
> | isInfixOf "chromosome 10," line = C10
> | isInfixOf "chromosome 11," line = C11
> | isInfixOf "chromosome 12," line = C12
> | isInfixOf "chromosome 13," line = C13
> | isInfixOf "chromosome 14," line = C14
> | isInfixOf "chromosome 15," line = C15
> | isInfixOf "chromosome 16," line = C16
> | isInfixOf "chromosome 17" line = C17
> | isInfixOf "chromosome 18" line = C18
> | isInfixOf "chromosome 19" line = C19
> | isInfixOf "chromosome X" line = CX
> | isInfixOf "chromosome Y" line = CY
> | isInfixOf "mitochondrion" line = CMT
> | otherwise = error "BAD header"
>
>
> -------------------------------- End of program B
> ------------------------------------------------------------------------------------------------
>
> -------------------------------- Program A
> ---------------------------------------------------------------------------------------------------------
> If instead of the main function given above I use the following main
> function to process only one input file, things work OK for even
> the largest files. Memory usage remains constant in this case.
>
> main = do
> -- get the arguments
> [input, output, windowSize] <- getArgs
> -- keep the input stream
> inpStr <- readFile input
> let wSize = (read windowSize)::Int
> writeFile output $ fastaExtractor inpStr wSize filterWindow
>
>
> It is not easy for me to see why is Haskell keeping data in memory. Do you
> have any idea why program B is
> not working?
>
> Thank you for your help!
>
> Arnoldo Muller
>
> Haskell-Cafe mailing list
> Haskell-Cafe@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>