[Haskell-cafe] Space leak

11 Mar 2010

      Hello,

I am learning haskell and I found a space leak that I find difficult to
solve. I've been asking at #haskell but we could not solve
the issue.

I want to lazily read a set of 22 files of about 200MB each, filter them and
then I want to output the result into a unique file.
If I modify the main function to work only with one input file,  the program
runs without issues. I will call this version A.
Version B  uses a mapM_ to iterate over a list of filenames and uses
appendFile to output the result of filtering each file.
In this case the memory usage grows sharply and quickly (profiles show
constant memory growth). In less than a minute, memory
occupation will make my system hang with swapping.

This is version B:

------------------------------- Program B
--------------------------------------------------------------------------------------------------------------------
import Data.List
import System.Environment
import System.Directory
import Control.Monad

-- different types of chromosomes
data Chromosome =    C1
                | C2
                | C3
                | C4
                | C5
                | C6
                | C7
                | C8
                | C9
                | C10
                | C11
                | C12
                | C13
                | C14
                | C15
                | C16
                | C17
                | C18
                | C19
                | CX
                | CY
                | CMT
                  deriving (Show)
-- define a window
type Sequence = [Char]
-- Window data
data Window = Window { sequen :: Sequence,
                       chrom :: Chromosome,
                       pos   :: Int
                     }
-- print a window
instance Show Window where
    show w =  (sequen w) ++ "\t" ++ show (chrom w) ++ "\t" ++ show (pos w)

-- Reading fasta files with haskell

-- Initialize the
main = do
       -- get the arguments (intput is
       [input, output, windowSize] <- getArgs
       -- get directory contents (only names)
       names <- getDirectoryContents input
       -- prepend directory
       let fullNames = filter isFastaFile $ map (\x -> input ++ "/" ++ x)
names
       let wSize = (read windowSize)::Int
       -- process the directories
       mapM (genomeExecute output wSize filterWindow)  fullNames

-- read the files one by one and write them to the output file
genomeExecute :: String -> Int -> (Window -> Bool) -> String -> IO ()
genomeExecute  outputFile windowSize f inputFile = do
  fileData <- readFile inputFile
  appendFile outputFile $ fastaExtractor fileData windowSize f

-- 
isFastaFile :: String -> Bool
isFastaFile fileName = isSuffixOf ".fa" fileName

-- fasta extractor (receives a Fasta String and returns a windowed string
ready to be sorted)
-- an example on how to compose several functions to parse a fasta file
fastaExtractor :: String -> Int -> (Window -> Bool) -> String
fastaExtractor input wSize f = printWindowList $ filter f $ readFasta  wSize
input

-- MAIN FILTER that removes N elements from the strings!
filterWindow :: Window -> Bool
filterWindow w = not (elem 'N' (sequen w))

-- print a window list (the printing makes it ready for output as raw data)
printWindowList :: [Window] -> String
printWindowList l = unlines $ map show l

-- read fasta, remove stuff that is not useful from it
-- removes the
readFasta :: Int -> [Char] -> [Window]
readFasta windowSize sequence =
    -- get the header
    let (header:rest) = lines sequence
        chr = parseChromosome header
        in

-- We now do the following:
--      take window                  create counter
remove newlines
   map (\(i, w) -> Window w chr i) $ zip [0..]  $ slideWindow windowSize  $
filter ( '\n' /= )  $ unlines rest

slideWindow :: Int -> [Char] -> [[Char]]
slideWindow _ [] = []
slideWindow windowSize l@(_:xs)  = take windowSize l : slideWindow
windowSize xs

-- Parse the chromosome from a fasta comment
-- produce a more compact chromosome representation
parseChromosome :: [Char] -> Chromosome
parseChromosome line
    | isInfixOf "chromosome 1," line = C1
    | isInfixOf "chromosome 2," line = C2
    | isInfixOf "chromosome 3," line = C3
    | isInfixOf "chromosome 4," line = C4
    | isInfixOf "chromosome 5," line = C5
    | isInfixOf "chromosome 6," line = C6
    | isInfixOf "chromosome 7," line = C7
    | isInfixOf "chromosome 8," line = C9
    | isInfixOf "chromosome 10," line = C10
    | isInfixOf "chromosome 11," line = C11
    | isInfixOf "chromosome 12," line = C12
    | isInfixOf "chromosome 13," line = C13
    | isInfixOf "chromosome 14," line = C14
    | isInfixOf "chromosome 15," line = C15
    | isInfixOf "chromosome 16," line = C16
    | isInfixOf "chromosome 17"  line = C17
    | isInfixOf "chromosome 18"  line = C18
    | isInfixOf "chromosome 19"  line = C19
    | isInfixOf "chromosome X"   line = CX
    | isInfixOf "chromosome Y"   line = CY
    | isInfixOf "mitochondrion"  line = CMT
    | otherwise = error "BAD header"

-------------------------------- End of program B
------------------------------------------------------------------------------------------------

-------------------------------- Program A
---------------------------------------------------------------------------------------------------------
If instead of the main function given above I use the following main
function to process only one input file, things work OK for even
the largest files. Memory usage remains constant in this case.

main = do
       -- get the arguments
       [input, output, windowSize] <- getArgs
       -- keep the input stream
       inpStr <- readFile input
       let wSize = (read windowSize)::Int
       writeFile output $ fastaExtractor inpStr wSize filterWindow

It is not easy for me to see why is Haskell keeping data in memory. Do you
have any idea why  program B is
not working?

Thank you for your help!

Arnoldo Muller