
17 Aug
2010
17 Aug
'10
11:28 p.m.
On Aug 17, 2010, at 11:51 PM, Ketil Malde wrote:
Yitzchak Gale
writes: I don't think the genome is typical text.
I think the typical *large* collection of text is text-encoded data, and not, for lack of a better word, literature. Genomics data is just an example.
I have a collection of 100,000 patents I'm working with. 5.5GB of XML, most of it (US-)English text. After stripping out the XML markup, it's 4GB of text. It's a random sample from some 14 million patents I could have access to, but 100,000 was more than enough.