
Am Mittwoch 10 März 2010 21:45:56 schrieb Arnoldo Muller:
Hello,
I am learning haskell and I found a space leak that I find difficult to solve. I've been asking at #haskell but we could not solve the issue.
I want to lazily read a set of 22 files of about 200MB each, filter them and then I want to output the result into a unique file. If I modify the main function to work only with one input file, the program runs without issues. I will call this version A. Version B uses a mapM_ to iterate over a list of filenames and uses appendFile to output the result of filtering each file. In this case the memory usage grows sharply and quickly (profiles show constant memory growth). In less than a minute, memory occupation will make my system hang with swapping.
No work is been done until the end, when all is tried to be done simultaneously. Make sure genomeExecute ... input1 has actually finished its work before genomeExecute ... input2 starts etc. One way is to use a stricter version of sequence_, sequence'_ :: Monad m => [m a] -> m () sequence'_ (x:xs) = do a <- x a `seq` sequence'_ xs sequence'_ [] = return () (nicer with BangPatterns, but not portable), and mapM'_ f = sequence'_ . map f Another option is making genomeExecute itself stricter.
This is version B:
------------------------------- Program B ------------------------------------------------------------------------ -------------------------------------------- import Data.List import System.Environment import System.Directory import Control.Monad
-- different types of chromosomes data Chromosome = C1
| C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C12 | C13 | C14 | C15 | C16 | C17 | C18 | C19 | CX | CY | CMT
deriving (Show) -- define a window type Sequence = [Char] -- Window data data Window = Window { sequen :: Sequence, chrom :: Chromosome, pos :: Int } -- print a window instance Show Window where show w = (sequen w) ++ "\t" ++ show (chrom w) ++ "\t" ++ show (pos w)
-- Reading fasta files with haskell
-- Initialize the main = do -- get the arguments (intput is [input, output, windowSize] <- getArgs -- get directory contents (only names) names <- getDirectoryContents input -- prepend directory let fullNames = filter isFastaFile $ map (\x -> input ++ "/" ++ x) names let wSize = (read windowSize)::Int -- process the directories mapM (genomeExecute output wSize filterWindow) fullNames
-- read the files one by one and write them to the output file genomeExecute :: String -> Int -> (Window -> Bool) -> String -> IO () genomeExecute outputFile windowSize f inputFile = do fileData <- readFile inputFile appendFile outputFile $ fastaExtractor fileData windowSize f