
Am Mittwoch 10 März 2010 23:01:28 schrieb Arnoldo Muller:
Hello Daniel:
Thanks! I employed mapM'_ but I am still getting the space leak. Any other hint?
Hmm, offhand, I don't see why that isn't strict enough. With some datafiles, I could try to investigate. One question, how does programme C with main = do [input, output, windowSize] <- getArgs let wSize = (read windowSize)::Int genomeExecute output wSize filterWindow input behave? Space leak or not? But yes, a few other hints I have (though they're not likely to squash the space leak). Generally, ByteString IO is often orders of magnitude faster than String IO and uses much less memory, so using (lazy) ByteStrings is worthy of consideration.
Arnoldo
-- define a window type Sequence = [Char] -- Window data data Window = Window { sequen :: Sequence, chrom :: Chromosome, pos :: Int } -- print a window instance Show Window where show w = (sequen w) ++ "\t" ++ show (chrom w) ++ "\t" ++ show (pos w) -- Reading fasta files with haskell -- Initialize the main = do -- get the arguments (intput is [input, output, windowSize] <- getArgs -- get directory contents (only names) names <- getDirectoryContents input -- prepend directory let fullNames = filter isFastaFile $ map (\x -> input ++ "/" ++ x) names ********* let fullNames = map ((input ++) . ("/" ++)) $ filter isFastaFile names saves a little work ********* let wSize = (read windowSize)::Int -- process the directories mapM (genomeExecute output wSize filterWindow) fullNames -- read the files one by one and write them to the output file genomeExecute :: String -> Int -> (Window -> Bool) -> String -> IO () genomeExecute outputFile windowSize f inputFile = do fileData <- readFile inputFile appendFile outputFile $ fastaExtractor fileData windowSize f ********* The arguments of fastaExtractor should be reversed, then genomeExecute outputFile windowSize f inputFile = appendFile outputFile . fastaExtractor' f windowSize =<< readFile inputFile ********* -- isFastaFile :: String -> Bool isFastaFile fileName = isSuffixOf ".fa" fileName -- fasta extractor (receives a Fasta String and returns a windowed string ready to be sorted) -- an example on how to compose several functions to parse a fasta file fastaExtractor :: String -> Int -> (Window -> Bool) -> String fastaExtractor input wSize f = printWindowList $ filter f $ readFasta wSize input ********* fastaExtractor' f wSize = printWindowList . filter f . readFasta wSize ********* -- MAIN FILTER that removes N elements from the strings! filterWindow :: Window -> Bool filterWindow w = not (elem 'N' (sequen w)) ********* filterWindow w = 'N' `notElem` sequen w ********* -- print a window list (the printing makes it ready for output as raw data) printWindowList :: [Window] -> String printWindowList l = unlines $ map show l -- read fasta, remove stuff that is not useful from it -- removes the readFasta :: Int -> [Char] -> [Window] readFasta windowSize sequence = -- get the header let (header:rest) = lines sequence chr = parseChromosome header in -- We now do the following: -- take window create counter remove newlines map (\(i, w) -> Window w chr i) $ zip [0..] $ slideWindow windowSize $ filter ( '\n' /= ) $ unlines rest ********* filter ('\n' /=) . unlines is odd. What about concat? Or readFasta wSize chrseq = case span (/= '\n') chrseq of (header, _:rest) -> let chr = parseChromosome header in map (\(i,w) -> Window w chr i) . zip [0 .. ] . slideWindow wSize $ filter (/= '\n') rest _ -> [] if your input file format had no other newline than the one between header and body, that'd be nice. ********* slideWindow :: Int -> [Char] -> [[Char]] slideWindow _ [] = [] slideWindow windowSize l@(_:xs) = take windowSize l : slideWindow windowSize xs ********* slideWindow wSize = map (take wSize) . tails *********