Re: [Haskell-cafe] Space leak

11 Mar 2010

      Hi Arnoldo

This doesn't address the space leak, but your parseChromosome function
looks very inefficient - isInfixOf is repeatedly checking the prefix
"chromosome" for C1 to CY. If you have a lot of CY's in a file then it
will do a lot of work parsing them.

The cleanest way of handling this would be to use a parser combinator
library with keywords for "chromosome" and "mitochondrion" - however
that might add a performance penalty itself.

Here is a version that should be fairly efficient although a little
ugly due to how it has to match literal chars in prefix of the string:

Add a import for Data.Char to the import list:
...
import Data.Char
Add Enum to the deriving clause of the Chromosome data type:
...
| C19
               | CX
               | CY
               | CMT
                 deriving (Show,Enum)
Replace parseChromosome with the one below.

Note that the derived Enum functions for Chromosome are indexed from
0.. whereas when you read one from the file it is indexed from 1.. so
you have to sub1 before using toEnum:

sub1 :: Int -> Int
sub1 x = x-1

parseChromosome :: [Char] -> Chromosome
parseChromosome ('c':'h':'r':'o':'m':'o':'s':'o':'m':'e':' ':xs) = chro xs
  where
   chro ('X' :_)              = CX
   chro ('Y' :_)              = CY
   chro ( x  : ',' :_)        | isDigit x  = toEnum (sub1 $ digitToInt x)
   chro ('1' :  x  : ',' :_ ) | isDigit x  = toEnum (sub1 $ (10+) $
digitToInt x)
   chro ('1' :  x  :_ )       | isDigit x  = toEnum (sub1 $ (10+) $
digitToInt x)
   chro _                     = error "BAD header"

parseChromosome ('m':'i':'t':'o':'c':'h':'o':'n':'d':'r':'i':'o':'n':_) = CMT
parseChromosome _                 = error "BAD header"

Best wishes

Stephen

Re: [Haskell-cafe] Space leak

Stephen Tetley