
Ivan Lazar Miljenovic
Seeing as how the genome just uses 4 base "letters",
Yes, the bulk of the data is not really "text" at all, but each sequence (it's fragmented due to the molecular division into chromosomes, and due to incompleteness) also has a textual header. Generally, the Fasta format looks like this:
sequence-id some arbitrary metadata blah blah ACGATATACGCGCATGCGAT... ..lines and lines of letters...
(As an aside, although there are only four nucleotides (ACGT), there are occasional wildcard characters, the most common being N for aNy nucleotide, but there are defined wildcards for all subsets of the alphabet.)
wouldn't it be better to not treat it as text but use something else?
I generally use ByteStrings, with the .Char8 interface if/when appropriate. This is actually a pretty good choice; even if people use Unicode in the headers, I don't particularly want to care - as long as it is transparent. In some cases, I'd like to, say, search headers for some specific string - in these cases, a nice, tidy, rich, and optimized Data.ByteString(.Lazy).UTF8 would be nice. (But obviously not terribly essential at the moment, since I haven't bothered to test the available options. I guess for my stuff, the (human consumable) text bits are neither very performance intensive, nor large, so I could probably and fairly cheaply wrap relevant operations or fields with Data.Text's {de,en}codeUtf8. And in practice - partly due to lacking software support, I'm sure - it's all ASCII anyway. :-) It'd be nice to have efficient substring searches and regular expression, etc for the sequence data, but often this will be better addressed by more specific algorithms, and in any case, a .Char8 implementation is likely to be more efficient than any gratuitous Unicode encoding.
(in case someone is trying to do their mad genetic manipulation by hand)?
You'd be surprised what a determined biologist can achive, armed only with Word, Excel, and a reckless disregard for surmountability. -k -- If I haven't seen further, it is by standing in the footprints of giants