
Hello, I am learning haskell and I found a space leak that I find difficult to solve. I've been asking at #haskell but we could not solve the issue. I want to lazily read a set of 22 files of about 200MB each, filter them and then I want to output the result into a unique file. If I modify the main function to work only with one input file, the program runs without issues. I will call this version A. Version B uses a mapM_ to iterate over a list of filenames and uses appendFile to output the result of filtering each file. In this case the memory usage grows sharply and quickly (profiles show constant memory growth). In less than a minute, memory occupation will make my system hang with swapping. This is version B: ------------------------------- Program B -------------------------------------------------------------------------------------------------------------------- import Data.List import System.Environment import System.Directory import Control.Monad -- different types of chromosomes data Chromosome = C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C12 | C13 | C14 | C15 | C16 | C17 | C18 | C19 | CX | CY | CMT deriving (Show) -- define a window type Sequence = [Char] -- Window data data Window = Window { sequen :: Sequence, chrom :: Chromosome, pos :: Int } -- print a window instance Show Window where show w = (sequen w) ++ "\t" ++ show (chrom w) ++ "\t" ++ show (pos w) -- Reading fasta files with haskell -- Initialize the main = do -- get the arguments (intput is [input, output, windowSize] <- getArgs -- get directory contents (only names) names <- getDirectoryContents input -- prepend directory let fullNames = filter isFastaFile $ map (\x -> input ++ "/" ++ x) names let wSize = (read windowSize)::Int -- process the directories mapM (genomeExecute output wSize filterWindow) fullNames -- read the files one by one and write them to the output file genomeExecute :: String -> Int -> (Window -> Bool) -> String -> IO () genomeExecute outputFile windowSize f inputFile = do fileData <- readFile inputFile appendFile outputFile $ fastaExtractor fileData windowSize f -- isFastaFile :: String -> Bool isFastaFile fileName = isSuffixOf ".fa" fileName -- fasta extractor (receives a Fasta String and returns a windowed string ready to be sorted) -- an example on how to compose several functions to parse a fasta file fastaExtractor :: String -> Int -> (Window -> Bool) -> String fastaExtractor input wSize f = printWindowList $ filter f $ readFasta wSize input -- MAIN FILTER that removes N elements from the strings! filterWindow :: Window -> Bool filterWindow w = not (elem 'N' (sequen w)) -- print a window list (the printing makes it ready for output as raw data) printWindowList :: [Window] -> String printWindowList l = unlines $ map show l -- read fasta, remove stuff that is not useful from it -- removes the readFasta :: Int -> [Char] -> [Window] readFasta windowSize sequence = -- get the header let (header:rest) = lines sequence chr = parseChromosome header in -- We now do the following: -- take window create counter remove newlines map (\(i, w) -> Window w chr i) $ zip [0..] $ slideWindow windowSize $ filter ( '\n' /= ) $ unlines rest slideWindow :: Int -> [Char] -> [[Char]] slideWindow _ [] = [] slideWindow windowSize l@(_:xs) = take windowSize l : slideWindow windowSize xs -- Parse the chromosome from a fasta comment -- produce a more compact chromosome representation parseChromosome :: [Char] -> Chromosome parseChromosome line | isInfixOf "chromosome 1," line = C1 | isInfixOf "chromosome 2," line = C2 | isInfixOf "chromosome 3," line = C3 | isInfixOf "chromosome 4," line = C4 | isInfixOf "chromosome 5," line = C5 | isInfixOf "chromosome 6," line = C6 | isInfixOf "chromosome 7," line = C7 | isInfixOf "chromosome 8," line = C9 | isInfixOf "chromosome 10," line = C10 | isInfixOf "chromosome 11," line = C11 | isInfixOf "chromosome 12," line = C12 | isInfixOf "chromosome 13," line = C13 | isInfixOf "chromosome 14," line = C14 | isInfixOf "chromosome 15," line = C15 | isInfixOf "chromosome 16," line = C16 | isInfixOf "chromosome 17" line = C17 | isInfixOf "chromosome 18" line = C18 | isInfixOf "chromosome 19" line = C19 | isInfixOf "chromosome X" line = CX | isInfixOf "chromosome Y" line = CY | isInfixOf "mitochondrion" line = CMT | otherwise = error "BAD header" -------------------------------- End of program B ------------------------------------------------------------------------------------------------ -------------------------------- Program A --------------------------------------------------------------------------------------------------------- If instead of the main function given above I use the following main function to process only one input file, things work OK for even the largest files. Memory usage remains constant in this case. main = do -- get the arguments [input, output, windowSize] <- getArgs -- keep the input stream inpStr <- readFile input let wSize = (read windowSize)::Int writeFile output $ fastaExtractor inpStr wSize filterWindow It is not easy for me to see why is Haskell keeping data in memory. Do you have any idea why program B is not working? Thank you for your help! Arnoldo Muller

Hello Arnoldo, Wednesday, March 10, 2010, 11:45:56 PM, you wrote:
I am learning haskell and I found a space leak that I find difficult to solve. I've been asking at #haskell but we could not solve the issue.
make some experiments - leave only one file and use version A, then replace appendFile with writeFile -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hello Bulat,
I ran program A with writeFile instead of appendFile and it still works
without problems.
Regarding program B, if I use writeFile the leaking still occurs.
Any other hints? :)
Arnoldo
On Wed, Mar 10, 2010 at 10:32 PM, Bulat Ziganshin wrote: Hello Arnoldo, Wednesday, March 10, 2010, 11:45:56 PM, you wrote: I am learning haskell and I found a space leak that I find
difficult to solve. I've been asking at #haskell but we could not solve
the issue. make some experiments - leave only one file and use version A, then
replace appendFile with writeFile --
Best regards,
Bulat mailto:Bulat.Ziganshin@gmail.com

Am Mittwoch 10 März 2010 21:45:56 schrieb Arnoldo Muller:
Hello,
I am learning haskell and I found a space leak that I find difficult to solve. I've been asking at #haskell but we could not solve the issue.
I want to lazily read a set of 22 files of about 200MB each, filter them and then I want to output the result into a unique file. If I modify the main function to work only with one input file, the program runs without issues. I will call this version A. Version B uses a mapM_ to iterate over a list of filenames and uses appendFile to output the result of filtering each file. In this case the memory usage grows sharply and quickly (profiles show constant memory growth). In less than a minute, memory occupation will make my system hang with swapping.
No work is been done until the end, when all is tried to be done simultaneously. Make sure genomeExecute ... input1 has actually finished its work before genomeExecute ... input2 starts etc. One way is to use a stricter version of sequence_, sequence'_ :: Monad m => [m a] -> m () sequence'_ (x:xs) = do a <- x a `seq` sequence'_ xs sequence'_ [] = return () (nicer with BangPatterns, but not portable), and mapM'_ f = sequence'_ . map f Another option is making genomeExecute itself stricter.
This is version B:
------------------------------- Program B ------------------------------------------------------------------------ -------------------------------------------- import Data.List import System.Environment import System.Directory import Control.Monad
-- different types of chromosomes data Chromosome = C1
| C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C12 | C13 | C14 | C15 | C16 | C17 | C18 | C19 | CX | CY | CMT
deriving (Show) -- define a window type Sequence = [Char] -- Window data data Window = Window { sequen :: Sequence, chrom :: Chromosome, pos :: Int } -- print a window instance Show Window where show w = (sequen w) ++ "\t" ++ show (chrom w) ++ "\t" ++ show (pos w)
-- Reading fasta files with haskell
-- Initialize the main = do -- get the arguments (intput is [input, output, windowSize] <- getArgs -- get directory contents (only names) names <- getDirectoryContents input -- prepend directory let fullNames = filter isFastaFile $ map (\x -> input ++ "/" ++ x) names let wSize = (read windowSize)::Int -- process the directories mapM (genomeExecute output wSize filterWindow) fullNames
-- read the files one by one and write them to the output file genomeExecute :: String -> Int -> (Window -> Bool) -> String -> IO () genomeExecute outputFile windowSize f inputFile = do fileData <- readFile inputFile appendFile outputFile $ fastaExtractor fileData windowSize f

Hello Daniel:
Thanks!
I employed mapM'_ but I am still getting the space leak.
Any other hint?
Arnoldo
On Wed, Mar 10, 2010 at 10:40 PM, Daniel Fischer
Am Mittwoch 10 März 2010 21:45:56 schrieb Arnoldo Muller:
Hello,
I am learning haskell and I found a space leak that I find difficult to solve. I've been asking at #haskell but we could not solve the issue.
I want to lazily read a set of 22 files of about 200MB each, filter them and then I want to output the result into a unique file. If I modify the main function to work only with one input file, the program runs without issues. I will call this version A. Version B uses a mapM_ to iterate over a list of filenames and uses appendFile to output the result of filtering each file. In this case the memory usage grows sharply and quickly (profiles show constant memory growth). In less than a minute, memory occupation will make my system hang with swapping.
No work is been done until the end, when all is tried to be done simultaneously. Make sure genomeExecute ... input1 has actually finished its work before genomeExecute ... input2 starts etc.
One way is to use a stricter version of sequence_,
sequence'_ :: Monad m => [m a] -> m () sequence'_ (x:xs) = do a <- x a `seq` sequence'_ xs sequence'_ [] = return ()
(nicer with BangPatterns, but not portable), and
mapM'_ f = sequence'_ . map f
Another option is making genomeExecute itself stricter.
This is version B:
------------------------------- Program B ------------------------------------------------------------------------ -------------------------------------------- import Data.List import System.Environment import System.Directory import Control.Monad
-- different types of chromosomes data Chromosome = C1
| C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C12 | C13 | C14 | C15 | C16 | C17 | C18 | C19 | CX | CY | CMT
deriving (Show) -- define a window type Sequence = [Char] -- Window data data Window = Window { sequen :: Sequence, chrom :: Chromosome, pos :: Int } -- print a window instance Show Window where show w = (sequen w) ++ "\t" ++ show (chrom w) ++ "\t" ++ show (pos w)
-- Reading fasta files with haskell
-- Initialize the main = do -- get the arguments (intput is [input, output, windowSize] <- getArgs -- get directory contents (only names) names <- getDirectoryContents input -- prepend directory let fullNames = filter isFastaFile $ map (\x -> input ++ "/" ++ x) names let wSize = (read windowSize)::Int -- process the directories mapM (genomeExecute output wSize filterWindow) fullNames
-- read the files one by one and write them to the output file genomeExecute :: String -> Int -> (Window -> Bool) -> String -> IO () genomeExecute outputFile windowSize f inputFile = do fileData <- readFile inputFile appendFile outputFile $ fastaExtractor fileData windowSize f

Am Mittwoch 10 März 2010 23:01:28 schrieb Arnoldo Muller:
Hello Daniel:
Thanks! I employed mapM'_ but I am still getting the space leak. Any other hint?
Hmm, offhand, I don't see why that isn't strict enough. With some datafiles, I could try to investigate. One question, how does programme C with main = do [input, output, windowSize] <- getArgs let wSize = (read windowSize)::Int genomeExecute output wSize filterWindow input behave? Space leak or not? But yes, a few other hints I have (though they're not likely to squash the space leak). Generally, ByteString IO is often orders of magnitude faster than String IO and uses much less memory, so using (lazy) ByteStrings is worthy of consideration.
Arnoldo
-- define a window type Sequence = [Char] -- Window data data Window = Window { sequen :: Sequence, chrom :: Chromosome, pos :: Int } -- print a window instance Show Window where show w = (sequen w) ++ "\t" ++ show (chrom w) ++ "\t" ++ show (pos w) -- Reading fasta files with haskell -- Initialize the main = do -- get the arguments (intput is [input, output, windowSize] <- getArgs -- get directory contents (only names) names <- getDirectoryContents input -- prepend directory let fullNames = filter isFastaFile $ map (\x -> input ++ "/" ++ x) names ********* let fullNames = map ((input ++) . ("/" ++)) $ filter isFastaFile names saves a little work ********* let wSize = (read windowSize)::Int -- process the directories mapM (genomeExecute output wSize filterWindow) fullNames -- read the files one by one and write them to the output file genomeExecute :: String -> Int -> (Window -> Bool) -> String -> IO () genomeExecute outputFile windowSize f inputFile = do fileData <- readFile inputFile appendFile outputFile $ fastaExtractor fileData windowSize f ********* The arguments of fastaExtractor should be reversed, then genomeExecute outputFile windowSize f inputFile = appendFile outputFile . fastaExtractor' f windowSize =<< readFile inputFile ********* -- isFastaFile :: String -> Bool isFastaFile fileName = isSuffixOf ".fa" fileName -- fasta extractor (receives a Fasta String and returns a windowed string ready to be sorted) -- an example on how to compose several functions to parse a fasta file fastaExtractor :: String -> Int -> (Window -> Bool) -> String fastaExtractor input wSize f = printWindowList $ filter f $ readFasta wSize input ********* fastaExtractor' f wSize = printWindowList . filter f . readFasta wSize ********* -- MAIN FILTER that removes N elements from the strings! filterWindow :: Window -> Bool filterWindow w = not (elem 'N' (sequen w)) ********* filterWindow w = 'N' `notElem` sequen w ********* -- print a window list (the printing makes it ready for output as raw data) printWindowList :: [Window] -> String printWindowList l = unlines $ map show l -- read fasta, remove stuff that is not useful from it -- removes the readFasta :: Int -> [Char] -> [Window] readFasta windowSize sequence = -- get the header let (header:rest) = lines sequence chr = parseChromosome header in -- We now do the following: -- take window create counter remove newlines map (\(i, w) -> Window w chr i) $ zip [0..] $ slideWindow windowSize $ filter ( '\n' /= ) $ unlines rest ********* filter ('\n' /=) . unlines is odd. What about concat? Or readFasta wSize chrseq = case span (/= '\n') chrseq of (header, _:rest) -> let chr = parseChromosome header in map (\(i,w) -> Window w chr i) . zip [0 .. ] . slideWindow wSize $ filter (/= '\n') rest _ -> [] if your input file format had no other newline than the one between header and body, that'd be nice. ********* slideWindow :: Int -> [Char] -> [[Char]] slideWindow _ [] = [] slideWindow windowSize l@(_:xs) = take windowSize l : slideWindow windowSize xs ********* slideWindow wSize = map (take wSize) . tails *********

Am Donnerstag 11 März 2010 00:24:28 schrieb Daniel Fischer:
Hmm, offhand, I don't see why that isn't strict enough.
Turns out, mapM_ was a red herring. The villain was (zip and map). I must confess, I don't know why it sort-of worked without the mapM_, though. "sort-of", because that also hung on to unnecessarily much memory, the space leak was just smaller than with the mapM_. A very small change that eliminates the space leak, is readFasta :: Int -> [Char] -> [Window] readFasta windowSize sequence = -- get the header let (header,rest) = span (/= '\n') sequence chr = parseChromosome header go i (w:ws) = Window w chr i : go (i+1) ws go _ [] = [] in go 0 $ slideWindow windowSize $ filter (/= '\n') rest You can improve performance by eliminating slideWindow and the intermediate Window list (merging fastaExtractor and readFasta), {-# LANGUAGE BangPatterns #-} readFasta2 :: (String -> Bool) -> Int -> String readFasta2 test windowSize sequence = let (header,rest) = span (/= '\n') sequence chr = parseChromosome header schr = show chr go !i st@(_:tl) | test w = w ++ '\t' : schr ++ '\t' : show i ++ '\n' : go (i+1) tl | otherwise = go (i+1) tl where w = take windowSize st go _ [] = [] in go 0 (filter (/= '\n')) rest

Daniel,
Thank you so much for helping me out with this issue!
Thanks to all the other answers from haskel-cafe members too!
As a newbie, I am not able to understand why zip and map would make a
problem...
Is there any link I could read that could help me to understand why in this
case
zip and map created a leak? What are some function compositions that should
be
avoided when doing lazy I/O?
Regards,
Arnoldo
On Thu, Mar 11, 2010 at 11:46 PM, Daniel Fischer
Am Donnerstag 11 März 2010 00:24:28 schrieb Daniel Fischer:
Hmm, offhand, I don't see why that isn't strict enough.
Turns out, mapM_ was a red herring. The villain was (zip and map). I must confess, I don't know why it sort-of worked without the mapM_, though. "sort-of", because that also hung on to unnecessarily much memory, the space leak was just smaller than with the mapM_.
A very small change that eliminates the space leak, is
readFasta :: Int -> [Char] -> [Window] readFasta windowSize sequence = -- get the header let (header,rest) = span (/= '\n') sequence chr = parseChromosome header go i (w:ws) = Window w chr i : go (i+1) ws go _ [] = [] in go 0 $ slideWindow windowSize $ filter (/= '\n') rest
You can improve performance by eliminating slideWindow and the intermediate Window list (merging fastaExtractor and readFasta),
{-# LANGUAGE BangPatterns #-}
readFasta2 :: (String -> Bool) -> Int -> String readFasta2 test windowSize sequence = let (header,rest) = span (/= '\n') sequence chr = parseChromosome header schr = show chr go !i st@(_:tl) | test w = w ++ '\t' : schr ++ '\t' : show i ++ '\n' : go (i+1) tl | otherwise = go (i+1) tl where w = take windowSize st go _ [] = [] in go 0 (filter (/= '\n')) rest

On Thu, Mar 11, 2010 at 3:44 PM, Arnoldo Muller
Daniel,
Thank you so much for helping me out with this issue!
Thanks to all the other answers from haskel-cafe members too!
As a newbie, I am not able to understand why zip and map would make a problem...
Is there any link I could read that could help me to understand why in this case zip and map created a leak? What are some function compositions that should be avoided when doing lazy I/O?
Actually, it's lazy I/O itself that should be avoided. Jason

Jason,
I am trying to use haskell in the analysis of bio data. One of the main
reasons I wanted to use haskell is because lazy I/O allows you to see a
large bio-sequence as if it was a string in memory.
In order to achieve the same result in an imperative language I would have
to write lots of error-prone iterators. I saw lazy I/O as a very strong
point in favor of Haskell.
Besides the space leaks that can occur and that are a bit difficult to find
for a newbie like me, are there any other reasons to avoid Lazy I/O?
Arnoldo.
On Sat, Mar 13, 2010 at 6:46 PM, Jason Dagit
On Thu, Mar 11, 2010 at 3:44 PM, Arnoldo Muller
wrote: Daniel,
Thank you so much for helping me out with this issue!
Thanks to all the other answers from haskel-cafe members too!
As a newbie, I am not able to understand why zip and map would make a problem...
Is there any link I could read that could help me to understand why in this case zip and map created a leak? What are some function compositions that should be avoided when doing lazy I/O?
Actually, it's lazy I/O itself that should be avoided.
Jason

On Sat, Mar 13, 2010 at 3:58 PM, Arnoldo Muller
Jason,
I am trying to use haskell in the analysis of bio data. One of the main reasons I wanted to use haskell is because lazy I/O allows you to see a large bio-sequence as if it was a string in memory. In order to achieve the same result in an imperative language I would have to write lots of error-prone iterators. I saw lazy I/O as a very strong point in favor of Haskell.
There's a safer lazy IO lib in Hackage: http://hackage.haskell.org/package/safe-lazy-io It seems the safer approach, though somewhat more confusing to some people, is the Iteratee pattern. The reasons why have probably been explained best on a paper on Oleg's site.
Besides the space leaks that can occur and that are a bit difficult to find for a newbie like me, are there any other reasons to avoid Lazy I/O?
Perhaps these two links will enlighten you. They did for me, and I'm now working out how exactly to convert a really inefficient but explicit IO program (char by char right now... yuck) to an Iteratee based parsing situation on a work-related project. Hopefully I'll be doing this all this coming week, and I'll be able to publish some results on my blog. (things come up though a lot at work, so I'm keeping my fingers crossed on this one). http://okmij.org/ftp/Haskell/Iteratee/Lazy-vs-correct.txt http://okmij.org/ftp/Streams.html Dave
Arnoldo.
On Sat, Mar 13, 2010 at 6:46 PM, Jason Dagit
wrote: On Thu, Mar 11, 2010 at 3:44 PM, Arnoldo Muller
wrote: Daniel,
Thank you so much for helping me out with this issue!
Thanks to all the other answers from haskel-cafe members too!
As a newbie, I am not able to understand why zip and map would make a problem...
Is there any link I could read that could help me to understand why in this case zip and map created a leak? What are some function compositions that should be avoided when doing lazy I/O?
Actually, it's lazy I/O itself that should be avoided.
Jason
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Sat, Mar 13, 2010 at 8:58 PM, Arnoldo Muller
Jason,
I am trying to use haskell in the analysis of bio data. One of the main reasons I wanted to use haskell is because lazy I/O allows you to see a large bio-sequence as if it was a string in memory. In order to achieve the same result in an imperative language I would have to write lots of error-prone iterators. I saw lazy I/O as a very strong point in favor of Haskell.
What about mmap function? It's available on Linux, you can use it on C and probably a lot of other imperative languages. Was the non-portability factor the issue with using it?

On Mar 13, 2010, at 18:58 , Arnoldo Muller wrote:
In order to achieve the same result in an imperative language I would have to write lots of error-prone iterators. I saw lazy I/O as a very strong point in favor of Haskell.
Besides the space leaks that can occur and that are a bit difficult to find for a newbie like me, are there any other reasons to avoid Lazy I/O?
The biggest problem is that it is completely impossible to detect, much less recover from, lazy I/O errors. (You could theoretically force the result under control of "evaluate", thus putting it back in IO, but then you lose all the laziness you wanted. Exceptions, in particular I/O exceptions, are by definition impure; so pure code can neither recognize nor deal with them.) -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Am Sonntag 14 März 2010 00:58:09 schrieb Arnoldo Muller:
Jason,
I am trying to use haskell in the analysis of bio data. One of the main reasons I wanted to use haskell is because lazy I/O allows you to see a large bio-sequence as if it was a string in memory. In order to achieve the same result in an imperative language I would have to write lots of error-prone iterators. I saw lazy I/O as a very strong point in favor of Haskell.
Besides the space leaks that can occur and that are a bit difficult to find for a newbie like me, are there any other reasons to avoid Lazy I/O?
You may be happy to hear that the space leak you encountered had __nothing whatsoever__ to do with lazy IO. It's true that lazy IO offers some pitfalls for the unwary (and some, but much fewer, for the wary), but I think the dangers of lazy IO tend to be exaggerated. For your application, readFile and appendFile are absolutely fine, the space leak occurred in the pure code. Below is a variant of your programme that doesn't use file-IO, the one readFasta function has the space leak, the currently commented-out one not. Compile with -O2, run with e.g. ./leak +RTS -s -M400M -RTS 3 10000000 10 one runs in constant space, the other not.
Arnoldo.
---------------------------------------------------------------------- {-# LANGUAGE BangPatterns #-} module Main (main) where import Data.List import System.Environment (getArgs) data Chromosome = C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C12 | C13 | C14 | C15 | C16 | C17 | C18 | C19 | CX | CY | CMT deriving (Show) type Sequence = [Char] data Window = Window { sequen :: Sequence, chrom :: Chromosome, pos :: Int } instance Show Window where show w = (sequen w) ++ "\t" ++ show (chrom w) ++ "\t" ++ show (pos w) main = do [ct, len, windowSize] <- getArgs let wSize = (read windowSize)::Int ln = read len inData = [(cn,ln) | cn <- [1 .. read ct]] mapM_ (uncurry $ genomeExecute filterWindow wSize) inData countLines :: String -> Int countLines = go 0 where go !acc [] = acc go !acc ('\n':cs) = go (acc+1) cs go !acc (_:cs ) = go acc cs genomeExecute :: (Window -> Bool) -> Int -> Int -> Int -> IO () genomeExecute flt wSize cn ln = print . countLines $ fastaExtractor ("chromosome " ++ show cn ++ ",\n" ++ replicate (cn*ln) 'A') wSize flt fastaExtractor :: String -> Int -> (Window -> Bool) -> String fastaExtractor input wSize f = printWindowList $ filter f $ readFasta wSize input filterWindow :: Window -> Bool filterWindow w = not (elem 'N' (sequen w)) printWindowList :: [Window] -> String printWindowList l = unlines $ map show l {- readFasta :: Int -> [Char] -> [Window] readFasta windowSize sequence = let (header,rest) = span (/= '\n') sequence chr = parseChromosome header go i (w:ws) = Window w chr i : go (i+1) ws go _ [] = [] in go 0 $ slideWindow windowSize $ filter (/= '\n') rest -} readFasta :: Int -> [Char] -> [Window] readFasta windowSize sequence = let (header,rest) = span (/= '\n') sequence chr = parseChromosome header in map (\(i, w) -> Window w chr i) $ zip [0..] $ slideWindow windowSize $ filter ( '\n' /= ) rest slideWindow :: Int -> [Char] -> [[Char]] slideWindow _ [] = [] slideWindow windowSize l@(_:xs) = take windowSize l : slideWindow windowSize xs parseChromosome :: [Char] -> Chromosome parseChromosome line | isInfixOf "chromosome 1," line = C1 | isInfixOf "chromosome 2," line = C2 | isInfixOf "chromosome 3," line = C3 | isInfixOf "chromosome 4," line = C4 | isInfixOf "chromosome 5," line = C5 | isInfixOf "chromosome 6," line = C6 | isInfixOf "chromosome 7," line = C7 | isInfixOf "chromosome 8," line = C9 | isInfixOf "chromosome 10," line = C10 | isInfixOf "chromosome 11," line = C11 | isInfixOf "chromosome 12," line = C12 | isInfixOf "chromosome 13," line = C13 | isInfixOf "chromosome 14," line = C14 | isInfixOf "chromosome 15," line = C15 | isInfixOf "chromosome 16," line = C16 | isInfixOf "chromosome 17" line = C17 | isInfixOf "chromosome 18" line = C18 | isInfixOf "chromosome 19" line = C19 | isInfixOf "chromosome X" line = CX | isInfixOf "chromosome Y" line = CY | isInfixOf "mitochondrion" line = CMT | otherwise = error "BAD header" ----------------------------------------------------------------------

Arnoldo Muller
I am trying to use haskell in the analysis of bio data. One of the main reasons I wanted to use haskell is because lazy I/O allows you to see a large bio-sequence as if it was a string in memory.
Funny you should mention it. I've written a bioinformatics library¹ that (naturally) supports reading and writing various file formats for sequences and alignments and stuff. Some of these files can be substantial in size (i.e., larger than my laptop's memory), so most IO of potentially large files (Fasta, BLAST XMl output, 454 SFF files...) are read lazily, and large Fasta sequences are read as lazy bytestrings. This works nicely for a lot of use cases (well, my use cases, at any rate, wich quite often boils down to streaming through the data). One thing to look out for is O(n) indexed access to lazy bytestrings, so there's a defragment operation that converts a sequence to a single chunk (which gives O(1) access, but of course must fit into memory). I guess the most annoying thing about laziness is that small test cases always work, you need Real Data to stress test your programs for excessive memory use. Lazy IO always worked well for me, so althouhg I feel I should look more deeply into "real" solutions, like Iteratee, my half-hearted attemts to do so have only resulted in the conclusion that it was more complicated, and thus postponed for some rainy day... lazy IO for lazy programmers, I guess. -k ¹ Stuff's on Hackage in the bioinformatics section and also on http://blog.malde.org and http//malde.org/~ketil/bioinformatics. -- If I haven't seen further, it is by standing in the footprints of giants

Ketil Malde
Lazy IO always worked well for me, so althouhg I feel I should look more deeply into "real" solutions, like Iteratee, my half-hearted attemts to do so have only resulted in the conclusion that it was more complicated, and thus postponed for some rainy day... lazy IO for lazy programmers, I guess.
If you have a huge data set and a large address space (say, a 64-bit processor), you may want to use bytestring-mmap. -- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.

Hello Arnoldo, Wednesday, March 10, 2010, 11:45:56 PM, you wrote:
I am learning haskell and I found a space leak that I find difficult to solve. I've been asking at #haskell but we could not solve the issue.
what if you use program B on single file? -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat,
The same happens, the memory starts to quickly fill up...
Arnoldo
On Wed, Mar 10, 2010 at 11:16 PM, Bulat Ziganshin wrote: Hello Arnoldo, Wednesday, March 10, 2010, 11:45:56 PM, you wrote: I am learning haskell and I found a space leak that I find
difficult to solve. I've been asking at #haskell but we could not solve
the issue. what if you use program B on single file? --
Best regards,
Bulat mailto:Bulat.Ziganshin@gmail.com

Hi Arnoldo This doesn't address the space leak, but your parseChromosome function looks very inefficient - isInfixOf is repeatedly checking the prefix "chromosome" for C1 to CY. If you have a lot of CY's in a file then it will do a lot of work parsing them. The cleanest way of handling this would be to use a parser combinator library with keywords for "chromosome" and "mitochondrion" - however that might add a performance penalty itself. Here is a version that should be fairly efficient although a little ugly due to how it has to match literal chars in prefix of the string: Add a import for Data.Char to the import list:
import Data.Char
Add Enum to the deriving clause of the Chromosome data type:
| C19 | CX | CY | CMT deriving (Show,Enum)
Replace parseChromosome with the one below. Note that the derived Enum functions for Chromosome are indexed from 0.. whereas when you read one from the file it is indexed from 1.. so you have to sub1 before using toEnum: sub1 :: Int -> Int sub1 x = x-1 parseChromosome :: [Char] -> Chromosome parseChromosome ('c':'h':'r':'o':'m':'o':'s':'o':'m':'e':' ':xs) = chro xs where chro ('X' :_) = CX chro ('Y' :_) = CY chro ( x : ',' :_) | isDigit x = toEnum (sub1 $ digitToInt x) chro ('1' : x : ',' :_ ) | isDigit x = toEnum (sub1 $ (10+) $ digitToInt x) chro ('1' : x :_ ) | isDigit x = toEnum (sub1 $ (10+) $ digitToInt x) chro _ = error "BAD header" parseChromosome ('m':'i':'t':'o':'c':'h':'o':'n':'d':'r':'i':'o':'n':_) = CMT parseChromosome _ = error "BAD header" Best wishes Stephen
participants (10)
-
Achim Schneider
-
Arnoldo Muller
-
Brandon S. Allbery KF8NH
-
Bulat Ziganshin
-
Daniel Fischer
-
David Leimbach
-
Jason Dagit
-
Ketil Malde
-
Rafael Almeida
-
Stephen Tetley