
Felipe Lessa
And what do you think about creating a real SeqData data type with two bases per byte? In terms of processing speed I guess there will be a small penalty, but if you need to have large quantities of base pairs in memory this would double your capacity =).
Yes, this is interesting in some cases. Obvious downsides would be a separate data type for protein sequences (20 characters, plus some wildcards), and more complicated string comparison (when a match is off by one). Oh, and lower case is sometimes used to signify less "important" regions, like repeats. Another choice is the 2bit format (used by BLAT, and supported in Bio for input/output, but not internally), which stores the alphabet proper directly in 2bit quantities, and uses a separate lists for gaps, lower case masking, and Ns (and is obviously extensible to wildcards). Too much extending, and you're likely to lose any benefit, though. Basically, it boils down to a set of different tradeoffs, and I think ByteString is a fairly good choice in *most* cases, and it deals - if not particularly elegantly, then at least fairly effectively with various conventions, like lower-casing or wild cards. -k -- If I haven't seen further, it is by standing in the footprints of giants