On Wed, Mar 10, 2010 at 10:20 PM, Justin Bailey <jgbailey@gmail.com> wrote:

Have you use the profiling tools available with GHC?

http://haskell.org/ghc/docs/latest/html/users_guide/profiling.html

On Wed, Mar 10, 2010 at 12:45 PM, Arnoldo Muller
<arnoldomuller@gmail.com> wrote:
> Hello,
>
> I am learning haskell and I found a space leak that I find difficult to
> solve. I've been asking at #haskell but we could not solve
> the issue.
>
> I want to lazily read a set of 22 files of about 200MB each, filter them and
> then I want to output the result into a unique file.
> If I modify the main function to work only with one input file, the program
> runs without issues. I will call this version A.
> Version B uses a mapM_ to iterate over a list of filenames and uses
> appendFile to output the result of filtering each file.
> In this case the memory usage grows sharply and quickly (profiles show
> constant memory growth). In less than a minute, memory
> occupation will make my system hang with swapping.
>
> This is version B:
>
> ------------------------------- Program B
> --------------------------------------------------------------------------------------------------------------------
> import Data.List
> import System.Environment
> import System.Directory
> import Control.Monad
>
>
> -- different types of chromosomes
> data Chromosome =    C1
>                 | C2
>                 | C3
>                 | C4
>                 | C5
>                 | C6
>                 | C7
>                 | C8
>                 | C9
>                 | C10
>                 | C11
>                 | C12
>                 | C13
>                 | C14
>                 | C15
>                 | C16
>                 | C17
>                 | C18
>                 | C19
>                 | CX
>                 | CY
>                 | CMT
>                   deriving (Show)
> -- define a window
> type Sequence = [Char]
> -- Window data
> data Window = Window { sequen :: Sequence,
>                        chrom :: Chromosome,
>                        pos   :: Int
>                      }
> -- print a window
> instance Show Window where
>     show w = (sequen w) ++ "\t" ++ show (chrom w) ++ "\t" ++ show (pos w)
>
> -- Reading fasta files with haskell
>
> -- Initialize the
> main = do
>        -- get the arguments (intput is
>        [input, output, windowSize] <- getArgs
>        -- get directory contents (only names)
>        names <- getDirectoryContents input
>        -- prepend directory
>        let fullNames = filter isFastaFile $ map (\x -> input ++ "/" ++ x)
> names
>        let wSize = (read windowSize)::Int
>        -- process the directories
>        mapM (genomeExecute output wSize filterWindow) fullNames
>
>
> -- read the files one by one and write them to the output file
> genomeExecute :: String -> Int -> (Window -> Bool) -> String -> IO ()
> genomeExecute outputFile windowSize f inputFile = do
> fileData <- readFile inputFile
> appendFile outputFile $ fastaExtractor fileData windowSize f
>
> --
> isFastaFile :: String -> Bool
> isFastaFile fileName = isSuffixOf ".fa" fileName
>
>
> -- fasta extractor (receives a Fasta String and returns a windowed string
> ready to be sorted)
> -- an example on how to compose several functions to parse a fasta file
> fastaExtractor :: String -> Int -> (Window -> Bool) -> String
> fastaExtractor input wSize f = printWindowList $ filter f $ readFasta wSize
> input
>
> -- MAIN FILTER that removes N elements from the strings!
> filterWindow :: Window -> Bool
> filterWindow w = not (elem 'N' (sequen w))
>
> -- print a window list (the printing makes it ready for output as raw data)
> printWindowList :: [Window] -> String
> printWindowList l = unlines $ map show l
>
> -- read fasta, remove stuff that is not useful from it
> -- removes the
> readFasta :: Int -> [Char] -> [Window]
> readFasta windowSize sequence =
>     -- get the header
>     let (header:rest) = lines sequence
>         chr = parseChromosome header
>         in
>
> -- We now do the following:
> --      take window                  create counter
> remove newlines
>    map (\(i, w) -> Window w chr i) $ zip [0..] $ slideWindow windowSize $
> filter ( '\n' /= ) $ unlines rest
>
>
> slideWindow :: Int -> [Char] -> [[Char]]
> slideWindow _ [] = []
> slideWindow windowSize l@(_:xs) = take windowSize l : slideWindow
> windowSize xs
>
>
>
> -- Parse the chromosome from a fasta comment
> -- produce a more compact chromosome representation
> parseChromosome :: [Char] -> Chromosome
> parseChromosome line
>     | isInfixOf "chromosome 1," line = C1
>     | isInfixOf "chromosome 2," line = C2
>     | isInfixOf "chromosome 3," line = C3
>     | isInfixOf "chromosome 4," line = C4
>     | isInfixOf "chromosome 5," line = C5
>     | isInfixOf "chromosome 6," line = C6
>     | isInfixOf "chromosome 7," line = C7
>     | isInfixOf "chromosome 8," line = C9
>     | isInfixOf "chromosome 10," line = C10
>     | isInfixOf "chromosome 11," line = C11
>     | isInfixOf "chromosome 12," line = C12
>     | isInfixOf "chromosome 13," line = C13
>     | isInfixOf "chromosome 14," line = C14
>     | isInfixOf "chromosome 15," line = C15
>     | isInfixOf "chromosome 16," line = C16
>     | isInfixOf "chromosome 17" line = C17
>     | isInfixOf "chromosome 18" line = C18
>     | isInfixOf "chromosome 19" line = C19
>     | isInfixOf "chromosome X"   line = CX
>     | isInfixOf "chromosome Y"   line = CY
>     | isInfixOf "mitochondrion" line = CMT
>     | otherwise = error "BAD header"
>
>
> -------------------------------- End of program B
> ------------------------------------------------------------------------------------------------
>
> -------------------------------- Program A
> ---------------------------------------------------------------------------------------------------------
> If instead of the main function given above I use the following main
> function to process only one input file, things work OK for even
> the largest files. Memory usage remains constant in this case.
>
> main = do
>        -- get the arguments
>        [input, output, windowSize] <- getArgs
>        -- keep the input stream
>        inpStr <- readFile input
>        let wSize = (read windowSize)::Int
>        writeFile output $ fastaExtractor inpStr wSize filterWindow
>
>
> It is not easy for me to see why is Haskell keeping data in memory. Do you
> have any idea why program B is
> not working?
>
> Thank you for your help!
>
> Arnoldo Muller
>

> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>