[ANN] bloomfilter 1.0 - Fast immutable and mutable Bloom filters - Haskell-Cafe

[ANN] bloomfilter 1.0 - Fast immutable and mutable Bloom filters

Bryan O'Sullivan

30 May 2008 30 May '08

10:30 p.m.

I'm pleased to announce the availability of a fast Bloom filter library for Haskell. A Bloom filter is a probabilistic data structure that provides a fast set membership querying capability. It does not give false negatives, but has a tunable false positive rate. (A false positive arises when the filter claims that an element is present, but in fact it is not.) The library is easy to use. As an example, here's a reimplementation of the Unix "spell" command. import Data.BloomFilter.Easy (easyList, elemB) main = do filt <- (easyList 0.01 . words) `fmap` readFile "dictionary.txt" let check word | word `elemB` filt = "" | otherwise = word ++ "\n" interact (concat . map check . lines) It is also carefully tuned for performance. On my laptop, I can sustain a construction or query rate well in excess of a million ByteStrings per second. Source code: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bloomfilter Latest bits: darcs get http://darcs.serpentine.com/bloomfilter

Show replies by date

Achim Schneider

30 May 30 May

10:51 p.m.

New subject: [ANN] bloomfilter 1.0 - Fast immutable and mutable Bloom filters

Bryan O'Sullivan wrote:

...

A Bloom filter is a probabilistic data structure that provides a fast set membership querying capability. It does not give false negatives, but has a tunable false positive rate. (A false positive arises when the filter claims that an element is present, but in fact it is not.)

/me squints. Please tell me that this isn't reversible. -- (c) this sig last receiving data processing entity. Inspect headers for past copyright information. All rights reserved. Unauthorised copying, hiring, renting, public performance and/or broadcasting of this signature prohibited.

ajb＠spamcop.net

31 May 31 May

7:25 a.m.

New subject: [ANN] bloomfilter 1.0 - Fast immutable and mutable Bloom filters

G'day all. Quoting Achim Schneider :

...

Please tell me that this isn't reversible.

It isn't reversible. Cheers, Andrew Bromage

Aaron Denney

5:45 p.m.

New subject: [ANN] bloomfilter 1.0 - Fast immutable and mutable Bloom filters

On 2008-05-30, Achim Schneider wrote:

...

Bryan O'Sullivan wrote:

...
A Bloom filter is a probabilistic data structure that provides a fast set membership querying capability. It does not give false negatives, but has a tunable false positive rate. (A false positive arises when the filter claims that an element is present, but in fact it is not.)

/me squints.

Please tell me that this isn't reversible.

Tell me what you mean by "reversible". You can't, for instance, extract the items in the set. -- Aaron Denney -><-

Achim Schneider

6:29 p.m.

New subject: [ANN] bloomfilter 1.0 - Fast immutable and mutable Bloom filters

Aaron Denney wrote:

...

On 2008-05-30, Achim Schneider wrote:

...
Bryan O'Sullivan wrote:

...
A Bloom filter is a probabilistic data structure that provides a fast set membership querying capability. It does not give false negatives, but has a tunable false positive rate. (A false positive arises when the filter claims that an element is present, but in fact it is not.)

/me squints.

Please tell me that this isn't reversible.

Tell me what you mean by "reversible". You can't, for instance, extract the items in the set.

I guess invertible would have been the right word, though it's still ambiguous. Turning it into something that does not give false positives, but has a tunable false negative rate. Without looking at the algorithm, I imagine it working somewhat like a hashtable, and this inversion would utterly destroy my intuition. -- (c) this sig last receiving data processing entity. Inspect headers for past copyright information. All rights reserved. Unauthorised copying, hiring, renting, public performance and/or broadcasting of this signature prohibited.

Jim Snow

10:22 p.m.

New subject: [ANN] bloomfilter 1.0 - Fast immutable and mutable Bloom filters

Achim Schneider wrote:

...

Aaron Denney wrote:

...
On 2008-05-30, Achim Schneider wrote:

...
Bryan O'Sullivan wrote:

...
A Bloom filter is a probabilistic data structure that provides a fast set membership querying capability. It does not give false negatives, but has a tunable false positive rate. (A false positive arises when the filter claims that an element is present, but in fact it is not.)

/me squints.

Please tell me that this isn't reversible.

Tell me what you mean by "reversible". You can't, for instance, extract the items in the set.

I guess invertible would have been the right word, though it's still ambiguous.

Turning it into something that does not give false positives, but has a tunable false negative rate.

Without looking at the algorithm, I imagine it working somewhat like a hashtable, and this inversion would utterly destroy my intuition.

Without looking at the code to verify that this is how it has really been implemented, a bloom filter is like a series of hash tables, where the hash table entries are one bit. The bit is set if there is an item that hashes to that value in the bloom filter. So, assuming a single hash table where half the bits are set, there's a 50% false positive rate and no false negatives when you do a membership test. To reduce the false positives, we can add another hash table with a different hash function. Assuming it also is half full, we can check if an item is in both tables, and our false positive rate drops to 25%. In practice, one might use something like 32 hash tables. This yields a false positive rate of 1/(2^32). Their most obvious application is to store the dictionary for a spell checker in a space-efficient way, though I have a friend who wrote a paper on using them for router caches. There was a google tech talk on bloom filters awhile ago: http://www.youtube.com/watch?v=947gWqwkhu0 -jim

Edward Kmett

2 Jun 2 Jun

7:25 p.m.

New subject: [ANN] bloomfilter 1.0 - Fast immutable and mutable Bloom filters

On Sat, May 31, 2008 at 6:22 PM, Jim Snow wrote:

...

In practice, one might use something like 32 hash tables. This yields a false positive rate of 1/(2^32). Their most obvious application is to store the dictionary for a spell checker in a space-efficient way, though I have a friend who wrote a paper on using them for router caches.

One minor technicality is that you don't actually use k separate hash tables. You use k separate hash functions, and hash using different functions into the same physical table with a goal of having approximately half of the bits in the table set when all of your data is hashed.

Scott Cruzen

3 Jun 3 Jun

11:03 p.m.

New subject: [ANN] bloomfilter 1.0 - Fast immutable and mutable Bloom filters

* Jim Snow [080531 15:23]:

...

Without looking at the code to verify that this is how it has really been implemented, a bloom filter is like a series of hash tables, where the hash table entries are one bit. The bit is set if there is an item that hashes to that value in the bloom filter. So, assuming a single hash table where half the bits are set, there's a 50% false positive rate and no false negatives when you do a membership test.

To reduce the false positives, we can add another hash table with a different hash function. Assuming it also is half full, we can check if an item is in both tables, and our false positive rate drops to 25%.

In practice, one might use something like 32 hash tables. This yields a false positive rate of 1/(2^32). Their most obvious application is to store the dictionary for a spell checker in a space-efficient way, though I have a friend who wrote a paper on using them for router caches.

There was a google tech talk on bloom filters awhile ago: http://www.youtube.com/watch?v=947gWqwkhu0

One other use: LOAF is a simple extension to email that lets you append your entire address book to outgoing mail message without compromising your privacy. Correspondents can use this information to prioritize their mail, and learn more about their social networks. The LOAF home page is at http://loaf.cantbedone.org.

David MacIver

31 May 31 May

9:03 a.m.

On Fri, May 30, 2008 at 11:30 PM, Bryan O'Sullivan wrote:

...

I'm pleased to announce the availability of a fast Bloom filter library for Haskell. A Bloom filter is a probabilistic data structure that provides a fast set membership querying capability. It does not give false negatives, but has a tunable false positive rate. (A false positive arises when the filter claims that an element is present, but in fact it is not.)

The library is easy to use. As an example, here's a reimplementation of the Unix "spell" command.

import Data.BloomFilter.Easy (easyList, elemB)

main = do filt <- (easyList 0.01 . words) `fmap` readFile "dictionary.txt" let check word | word `elemB` filt = "" | otherwise = word ++ "\n" interact (concat . map check . lines)

It is also carefully tuned for performance. On my laptop, I can sustain a construction or query rate well in excess of a million ByteStrings per second.

Source code:

http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bloomfilter

Latest bits:

darcs get http://darcs.serpentine.com/bloomfilter

The Hashable stuff in there looks like it might be independently useful. Any interest in splitting it out into an independent package or is it really intended to be something fairly specific to the Bloom filter implementation?

Bryan O'Sullivan

3:09 p.m.

David MacIver wrote:

...

The Hashable stuff in there looks like it might be independently useful.

Probably, yes.

...

Any interest in splitting it out into an independent package or is it really intended to be something fairly specific to the Bloom filter implementation?

I'll split them if there's a request, but time constraints must alas prevail in the face of mere polite interest :-)

David MacIver

3:13 p.m.

On Sat, May 31, 2008 at 4:09 PM, Bryan O'Sullivan wrote:

...

David MacIver wrote:

...
The Hashable stuff in there looks like it might be independently useful.

Probably, yes.

...
Any interest in splitting it out into an independent package or is it really intended to be something fairly specific to the Bloom filter implementation?

I'll split them if there's a request, but time constraints must alas prevail in the face of mere polite interest :-)

Seems fair enough to me. I was indeed just curious. :-) I'll shout if I have an actual use case (or possibly just include the whole thing. It's not exactly a major size burden)

Thomas Hartman

4:33 p.m.

What kind of speed do you get on your laptop for Data.Set? How much faster is the bloom filter? thomas. 2008/5/30 Bryan O'Sullivan :

...

I'm pleased to announce the availability of a fast Bloom filter library for Haskell. A Bloom filter is a probabilistic data structure that provides a fast set membership querying capability. It does not give false negatives, but has a tunable false positive rate. (A false positive arises when the filter claims that an element is present, but in fact it is not.)

The library is easy to use. As an example, here's a reimplementation of the Unix "spell" command.

import Data.BloomFilter.Easy (easyList, elemB)

main = do filt <- (easyList 0.01 . words) `fmap` readFile "dictionary.txt" let check word | word `elemB` filt = "" | otherwise = word ++ "\n" interact (concat . map check . lines)

It is also carefully tuned for performance. On my laptop, I can sustain a construction or query rate well in excess of a million ByteStrings per second.

Source code:

http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bloomfilter

Latest bits:

darcs get http://darcs.serpentine.com/bloomfilter

http://www.haskell.org/mailman/listinfo/haskell-cafe

0 0

Reply
Sign in to reply online Use email software

Ketil Malde

3 Jun 3 Jun
8:26 a.m.

"Thomas Hartman" writes:

...
What kind of speed do you get on your laptop for Data.Set? How much faster is the bloom filter?

I tried to modify examples/Words.hs to use Data.Set insted. The results look like this (first Bloom, second Data.Set, both compiled with -O2): nmd9999:..filter/examples % ./Words 57025 words 0.013326ss to count words Bloom { 4194304 bits } 0.050608ss to construct filter 0.034806ss to query every element nmd9999:..filter/examples % ./WordsS 57025 words 0.013291ss to count words False 0.755115ss to construct filter 0.423289ss to query every element In order to avoid printing the entire set, while still evaluating it, I replaced the printing of the set with printing the result of a search for a non-existing element - I should really use a strict insert, I guess. Anyway, this hopefully gives an indication - looks like a factor of 10 in this case, but it will depend on the size of the data - more data, greater improvement. BTW, Nice work, Bryan! I have plans for this. -k -- If I haven't seen further, it is by standing in the footprints of giants

0 0

Reply
Sign in to reply online Use email software

6241
Age (days ago)

6245
Last active (days ago)

List overview
Download

12 comments

10 participants

Add to favorites Remove from favorites

tags

participants (10)

Aaron Denney

Achim Schneider

ajb＠spamcop.net

Bryan O'Sullivan

David MacIver

Edward Kmett

Jim Snow

Ketil Malde

Scott Cruzen

Thomas Hartman