Re: [Haskell-cafe] Re: String vs ByteString

17 Aug 2010

      Felipe Lessa  writes:

[-snip- I've already spent too much time on the other stuff :-]
...
And what do you think about creating a real SeqData data type
with two bases per byte?  In terms of processing speed I guess
there will be a small penalty, but if you need to have large
quantities of base pairs in memory this would double your
capacity =).
Yes, this is interesting in some cases.  Obvious downsides would be a
separate data type for protein sequences (20 characters, plus some
wildcards), and more complicated string comparison (when a match is off
by one).  Oh, and lower case is sometimes used to signify less
"important" regions, like repeats.

Another choice is the 2bit format (used by BLAT, and supported in Bio
for input/output, but not internally), which stores the alphabet proper
directly in 2bit quantities, and uses a separate lists for gaps, lower
case masking, and Ns (and is obviously extensible to wildcards).  Too
much extending, and you're likely to lose any benefit, though.

Basically, it boils down to a set of different tradeoffs, and I think
ByteString is a fairly good choice in *most* cases, and it deals - if
not particularly elegantly, then at least fairly effectively with
various conventions, like lower-casing or wild cards.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants

Re: [Haskell-cafe] Re: String vs ByteString

Ketil Malde