
Am Sonntag 14 März 2010 00:58:09 schrieb Arnoldo Muller:
Jason,
I am trying to use haskell in the analysis of bio data. One of the main reasons I wanted to use haskell is because lazy I/O allows you to see a large bio-sequence as if it was a string in memory. In order to achieve the same result in an imperative language I would have to write lots of error-prone iterators. I saw lazy I/O as a very strong point in favor of Haskell.
Besides the space leaks that can occur and that are a bit difficult to find for a newbie like me, are there any other reasons to avoid Lazy I/O?
You may be happy to hear that the space leak you encountered had __nothing whatsoever__ to do with lazy IO. It's true that lazy IO offers some pitfalls for the unwary (and some, but much fewer, for the wary), but I think the dangers of lazy IO tend to be exaggerated. For your application, readFile and appendFile are absolutely fine, the space leak occurred in the pure code. Below is a variant of your programme that doesn't use file-IO, the one readFasta function has the space leak, the currently commented-out one not. Compile with -O2, run with e.g. ./leak +RTS -s -M400M -RTS 3 10000000 10 one runs in constant space, the other not.
Arnoldo.
---------------------------------------------------------------------- {-# LANGUAGE BangPatterns #-} module Main (main) where import Data.List import System.Environment (getArgs) data Chromosome = C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C12 | C13 | C14 | C15 | C16 | C17 | C18 | C19 | CX | CY | CMT deriving (Show) type Sequence = [Char] data Window = Window { sequen :: Sequence, chrom :: Chromosome, pos :: Int } instance Show Window where show w = (sequen w) ++ "\t" ++ show (chrom w) ++ "\t" ++ show (pos w) main = do [ct, len, windowSize] <- getArgs let wSize = (read windowSize)::Int ln = read len inData = [(cn,ln) | cn <- [1 .. read ct]] mapM_ (uncurry $ genomeExecute filterWindow wSize) inData countLines :: String -> Int countLines = go 0 where go !acc [] = acc go !acc ('\n':cs) = go (acc+1) cs go !acc (_:cs ) = go acc cs genomeExecute :: (Window -> Bool) -> Int -> Int -> Int -> IO () genomeExecute flt wSize cn ln = print . countLines $ fastaExtractor ("chromosome " ++ show cn ++ ",\n" ++ replicate (cn*ln) 'A') wSize flt fastaExtractor :: String -> Int -> (Window -> Bool) -> String fastaExtractor input wSize f = printWindowList $ filter f $ readFasta wSize input filterWindow :: Window -> Bool filterWindow w = not (elem 'N' (sequen w)) printWindowList :: [Window] -> String printWindowList l = unlines $ map show l {- readFasta :: Int -> [Char] -> [Window] readFasta windowSize sequence = let (header,rest) = span (/= '\n') sequence chr = parseChromosome header go i (w:ws) = Window w chr i : go (i+1) ws go _ [] = [] in go 0 $ slideWindow windowSize $ filter (/= '\n') rest -} readFasta :: Int -> [Char] -> [Window] readFasta windowSize sequence = let (header,rest) = span (/= '\n') sequence chr = parseChromosome header in map (\(i, w) -> Window w chr i) $ zip [0..] $ slideWindow windowSize $ filter ( '\n' /= ) rest slideWindow :: Int -> [Char] -> [[Char]] slideWindow _ [] = [] slideWindow windowSize l@(_:xs) = take windowSize l : slideWindow windowSize xs parseChromosome :: [Char] -> Chromosome parseChromosome line | isInfixOf "chromosome 1," line = C1 | isInfixOf "chromosome 2," line = C2 | isInfixOf "chromosome 3," line = C3 | isInfixOf "chromosome 4," line = C4 | isInfixOf "chromosome 5," line = C5 | isInfixOf "chromosome 6," line = C6 | isInfixOf "chromosome 7," line = C7 | isInfixOf "chromosome 8," line = C9 | isInfixOf "chromosome 10," line = C10 | isInfixOf "chromosome 11," line = C11 | isInfixOf "chromosome 12," line = C12 | isInfixOf "chromosome 13," line = C13 | isInfixOf "chromosome 14," line = C14 | isInfixOf "chromosome 15," line = C15 | isInfixOf "chromosome 16," line = C16 | isInfixOf "chromosome 17" line = C17 | isInfixOf "chromosome 18" line = C18 | isInfixOf "chromosome 19" line = C19 | isInfixOf "chromosome X" line = CX | isInfixOf "chromosome Y" line = CY | isInfixOf "mitochondrion" line = CMT | otherwise = error "BAD header" ----------------------------------------------------------------------