[ANN] bloomfilter 1.0 - Fast immutable and mutable Bloom filters

I'm pleased to announce the availability of a fast Bloom filter library for Haskell. A Bloom filter is a probabilistic data structure that provides a fast set membership querying capability. It does not give false negatives, but has a tunable false positive rate. (A false positive arises when the filter claims that an element is present, but in fact it is not.) The library is easy to use. As an example, here's a reimplementation of the Unix "spell" command. import Data.BloomFilter.Easy (easyList, elemB) main = do filt <- (easyList 0.01 . words) `fmap` readFile "dictionary.txt" let check word | word `elemB` filt = "" | otherwise = word ++ "\n" interact (concat . map check . lines) It is also carefully tuned for performance. On my laptop, I can sustain a construction or query rate well in excess of a million ByteStrings per second. Source code: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bloomfilter Latest bits: darcs get http://darcs.serpentine.com/bloomfilter

Bryan O'Sullivan
A Bloom filter is a probabilistic data structure that provides a fast set membership querying capability. It does not give false negatives, but has a tunable false positive rate. (A false positive arises when the filter claims that an element is present, but in fact it is not.)
/me squints. Please tell me that this isn't reversible. -- (c) this sig last receiving data processing entity. Inspect headers for past copyright information. All rights reserved. Unauthorised copying, hiring, renting, public performance and/or broadcasting of this signature prohibited.

On 2008-05-30, Achim Schneider
Bryan O'Sullivan
wrote: A Bloom filter is a probabilistic data structure that provides a fast set membership querying capability. It does not give false negatives, but has a tunable false positive rate. (A false positive arises when the filter claims that an element is present, but in fact it is not.)
/me squints.
Please tell me that this isn't reversible.
Tell me what you mean by "reversible". You can't, for instance, extract the items in the set. -- Aaron Denney -><-

Aaron Denney
On 2008-05-30, Achim Schneider
wrote: Bryan O'Sullivan
wrote: A Bloom filter is a probabilistic data structure that provides a fast set membership querying capability. It does not give false negatives, but has a tunable false positive rate. (A false positive arises when the filter claims that an element is present, but in fact it is not.)
/me squints.
Please tell me that this isn't reversible.
Tell me what you mean by "reversible". You can't, for instance, extract the items in the set.
I guess invertible would have been the right word, though it's still ambiguous. Turning it into something that does not give false positives, but has a tunable false negative rate. Without looking at the algorithm, I imagine it working somewhat like a hashtable, and this inversion would utterly destroy my intuition. -- (c) this sig last receiving data processing entity. Inspect headers for past copyright information. All rights reserved. Unauthorised copying, hiring, renting, public performance and/or broadcasting of this signature prohibited.

Achim Schneider wrote:
Aaron Denney
wrote: On 2008-05-30, Achim Schneider
wrote: Bryan O'Sullivan
wrote: A Bloom filter is a probabilistic data structure that provides a fast set membership querying capability. It does not give false negatives, but has a tunable false positive rate. (A false positive arises when the filter claims that an element is present, but in fact it is not.)
/me squints.
Please tell me that this isn't reversible.
Tell me what you mean by "reversible". You can't, for instance, extract the items in the set.
I guess invertible would have been the right word, though it's still ambiguous.
Turning it into something that does not give false positives, but has a tunable false negative rate.
Without looking at the algorithm, I imagine it working somewhat like a hashtable, and this inversion would utterly destroy my intuition.
Without looking at the code to verify that this is how it has really been implemented, a bloom filter is like a series of hash tables, where the hash table entries are one bit. The bit is set if there is an item that hashes to that value in the bloom filter. So, assuming a single hash table where half the bits are set, there's a 50% false positive rate and no false negatives when you do a membership test. To reduce the false positives, we can add another hash table with a different hash function. Assuming it also is half full, we can check if an item is in both tables, and our false positive rate drops to 25%. In practice, one might use something like 32 hash tables. This yields a false positive rate of 1/(2^32). Their most obvious application is to store the dictionary for a spell checker in a space-efficient way, though I have a friend who wrote a paper on using them for router caches. There was a google tech talk on bloom filters awhile ago: http://www.youtube.com/watch?v=947gWqwkhu0 -jim

On Sat, May 31, 2008 at 6:22 PM, Jim Snow
In practice, one might use something like 32 hash tables. This yields a false positive rate of 1/(2^32). Their most obvious application is to store the dictionary for a spell checker in a space-efficient way, though I have a friend who wrote a paper on using them for router caches.
One minor technicality is that you don't actually use k separate hash tables. You use k separate hash functions, and hash using different functions into the same physical table with a goal of having approximately half of the bits in the table set when all of your data is hashed.

* Jim Snow
Without looking at the code to verify that this is how it has really been implemented, a bloom filter is like a series of hash tables, where the hash table entries are one bit. The bit is set if there is an item that hashes to that value in the bloom filter. So, assuming a single hash table where half the bits are set, there's a 50% false positive rate and no false negatives when you do a membership test.
To reduce the false positives, we can add another hash table with a different hash function. Assuming it also is half full, we can check if an item is in both tables, and our false positive rate drops to 25%.
In practice, one might use something like 32 hash tables. This yields a false positive rate of 1/(2^32). Their most obvious application is to store the dictionary for a spell checker in a space-efficient way, though I have a friend who wrote a paper on using them for router caches.
There was a google tech talk on bloom filters awhile ago: http://www.youtube.com/watch?v=947gWqwkhu0
One other use: LOAF is a simple extension to email that lets you append your entire address book to outgoing mail message without compromising your privacy. Correspondents can use this information to prioritize their mail, and learn more about their social networks. The LOAF home page is at http://loaf.cantbedone.org.

On Fri, May 30, 2008 at 11:30 PM, Bryan O'Sullivan
I'm pleased to announce the availability of a fast Bloom filter library for Haskell. A Bloom filter is a probabilistic data structure that provides a fast set membership querying capability. It does not give false negatives, but has a tunable false positive rate. (A false positive arises when the filter claims that an element is present, but in fact it is not.)
The library is easy to use. As an example, here's a reimplementation of the Unix "spell" command.
import Data.BloomFilter.Easy (easyList, elemB)
main = do filt <- (easyList 0.01 . words) `fmap` readFile "dictionary.txt" let check word | word `elemB` filt = "" | otherwise = word ++ "\n" interact (concat . map check . lines)
It is also carefully tuned for performance. On my laptop, I can sustain a construction or query rate well in excess of a million ByteStrings per second.
Source code:
http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bloomfilter
Latest bits:
darcs get http://darcs.serpentine.com/bloomfilter
The Hashable stuff in there looks like it might be independently useful. Any interest in splitting it out into an independent package or is it really intended to be something fairly specific to the Bloom filter implementation?

David MacIver wrote:
The Hashable stuff in there looks like it might be independently useful.
Probably, yes.
Any interest in splitting it out into an independent package or is it really intended to be something fairly specific to the Bloom filter implementation?
I'll split them if there's a request, but time constraints must alas prevail in the face of mere polite interest :-)

On Sat, May 31, 2008 at 4:09 PM, Bryan O'Sullivan
David MacIver wrote:
The Hashable stuff in there looks like it might be independently useful.
Probably, yes.
Any interest in splitting it out into an independent package or is it really intended to be something fairly specific to the Bloom filter implementation?
I'll split them if there's a request, but time constraints must alas prevail in the face of mere polite interest :-)
Seems fair enough to me. I was indeed just curious. :-) I'll shout if I have an actual use case (or possibly just include the whole thing. It's not exactly a major size burden)

What kind of speed do you get on your laptop for Data.Set? How much
faster is the bloom filter?
thomas.
2008/5/30 Bryan O'Sullivan
I'm pleased to announce the availability of a fast Bloom filter library for Haskell. A Bloom filter is a probabilistic data structure that provides a fast set membership querying capability. It does not give false negatives, but has a tunable false positive rate. (A false positive arises when the filter claims that an element is present, but in fact it is not.)
The library is easy to use. As an example, here's a reimplementation of the Unix "spell" command.
import Data.BloomFilter.Easy (easyList, elemB)
main = do filt <- (easyList 0.01 . words) `fmap` readFile "dictionary.txt" let check word | word `elemB` filt = "" | otherwise = word ++ "\n" interact (concat . map check . lines)
It is also carefully tuned for performance. On my laptop, I can sustain a construction or query rate well in excess of a million ByteStrings per second.
Source code:
http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bloomfilter
Latest bits:
darcs get http://darcs.serpentine.com/bloomfilter
http://www.haskell.org/mailman/listinfo/haskell-cafe

"Thomas Hartman"
What kind of speed do you get on your laptop for Data.Set? How much faster is the bloom filter?
I tried to modify examples/Words.hs to use Data.Set insted. The results look like this (first Bloom, second Data.Set, both compiled with -O2): nmd9999:..filter/examples % ./Words 57025 words 0.013326ss to count words Bloom { 4194304 bits } 0.050608ss to construct filter 0.034806ss to query every element nmd9999:..filter/examples % ./WordsS 57025 words 0.013291ss to count words False 0.755115ss to construct filter 0.423289ss to query every element In order to avoid printing the entire set, while still evaluating it, I replaced the printing of the set with printing the result of a search for a non-existing element - I should really use a strict insert, I guess. Anyway, this hopefully gives an indication - looks like a factor of 10 in this case, but it will depend on the size of the data - more data, greater improvement. BTW, Nice work, Bryan! I have plans for this. -k -- If I haven't seen further, it is by standing in the footprints of giants
participants (10)
-
Aaron Denney
-
Achim Schneider
-
ajb@spamcop.net
-
Bryan O'Sullivan
-
David MacIver
-
Edward Kmett
-
Jim Snow
-
Ketil Malde
-
Scott Cruzen
-
Thomas Hartman