Thanks for the response.
> Let us say, we are using a bit-array of size 2^43 (that is, a byte array ofSince we are just making up numbers, let us instead say we are using a
> size 2^40) to store a bloom filter. And let us further assume that we are
> interested in a false-positive probability of 0.01
bit array of size 2^32 - still too large for Int indexing (which was the
issue here) but "only" 500MB in size.
...bringing it down to less than 10^9, easily reached for building an
> That means, I will be able to use this array to represent a set of
> cardinality 9.18e11 ~ 10^12
indexing of k-words (tuples) in e.g. the human genome (3GB).
But: I'm about to start analyzing Illumina sequencing data, where we have
sequences from two individuals of the same species. I'm interesting in
the differences between these species, so I might want to index both
data sets and screen the sequences of each against the other to identify
bits that don't appear in the other. Since I'll have easily tens, maybe
a hundred gigabytes of sequence from each, it's not impossible to get
into the territory you describe.
(In practice, things will be limited by available RAM, sadly still some
orders of magnitude less than 2^40 - although apparently SGI can deliver
it. I guess I just need a bigger budget.)
Most likely a program using mutation, and probably not copying,
> I was curious to know what sort of programs would be dealing with sets of
> 10^12 elements.
multi-generation GC.
Other data sets that have considerable size are acoustics data (I
understand a modern sonar deliver about a Gbit/sec continously), and
seismic data.
Also, for more moderate bodies of data and more complex analysis, you
might want to use less frugal data structure, like suffix arrays.
-k
--
If I haven't seen further, it is by standing in the footprints of giants
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe