Modern, high performance bloom filter in Python? [closed]

1 Answers

I recently went down this path as well; though it sounds like my application was slightly different. I was interested in approximating set operations on a large number of strings.

You do make the key observation that a fast bit vector is required. Depending on what you want to put in your bloom filter, you may also need to give some thought to the speed of the hashing algorithm(s) used. You might find this library useful. You may also want to tinker with the random number technique used below that only hashes your key a single time.

In terms of non-Java bit array implementations:

Boost has dynamic_bitset
Java has the built in BitSet

I built my bloom filter using BitVector. I spent some time profiling and optimizing the library and contributing back my patches to Avi. Go to that BitVector link and scroll down to acknowledgments in v1.5 to see details. In the end, I realized that performance was not a goal of this project and decided against using it.

Here's some code I had lying around. I may put this up on google code at python-bloom. Suggestions welcome.

from BitVector import BitVector from random import Random # get hashes from http://www.partow.net/programming/hashfunctions/index.html from hashes import RSHash, JSHash, PJWHash, ELFHash, DJBHash   # # [email protected] / www.asciiarmor.com # # copyright (c) 2008, ryan cox # all rights reserved  # BSD license: http://www.opensource.org/licenses/bsd-license.php #  class BloomFilter(object):     def __init__(self, n=None, m=None, k=None, p=None, bits=None ):         self.m = m         if k > 4 or k < 1:             raise Exception('Must specify value of k between 1 and 4')         self.k = k         if bits:             self.bits = bits         else:             self.bits = BitVector( size=m )         self.rand = Random()         self.hashes = []         self.hashes.append(RSHash)         self.hashes.append(JSHash)         self.hashes.append(PJWHash)         self.hashes.append(DJBHash)          # switch between hashing techniques         self._indexes = self._rand_indexes         #self._indexes = self._hash_indexes      def __contains__(self, key):         for i in self._indexes(key):              if not self.bits[i]:                 return False             return True       def add(self, key):         dupe = True          bits = []         for i in self._indexes(key):              if dupe and not self.bits[i]:                 dupe = False             self.bits[i] = 1             bits.append(i)         return dupe      def __and__(self, filter):         if (self.k != filter.k) or (self.m != filter.m):              raise Exception('Must use bloom filters created with equal k / m paramters for bitwise AND')         return BloomFilter(m=self.m,k=self.k,bits=(self.bits & filter.bits))      def __or__(self, filter):         if (self.k != filter.k) or (self.m != filter.m):              raise Exception('Must use bloom filters created with equal k / m paramters for bitwise OR')         return BloomFilter(m=self.m,k=self.k,bits=(self.bits | filter.bits))      def _hash_indexes(self,key):         ret = []         for i in range(self.k):             ret.append(self.hashes[i](key) % self.m)         return ret      def _rand_indexes(self,key):         self.rand.seed(hash(key))         ret = []         for i in range(self.k):             ret.append(self.rand.randint(0,self.m-1))         return ret  if __name__ == '__main__':     e = BloomFilter(m=100, k=4)     e.add('one')     e.add('two')     e.add('three')     e.add('four')     e.add('five')              f = BloomFilter(m=100, k=4)     f.add('three')     f.add('four')     f.add('five')     f.add('six')     f.add('seven')     f.add('eight')     f.add('nine')     f.add("ten")              # test check for dupe on add     assert not f.add('eleven')      assert f.add('eleven')       # test membership operations     assert 'ten' in f      assert 'one' in e      assert 'ten' not in e      assert 'one' not in f               # test set based operations     union = f | e     intersection = f & e      assert 'ten' in union     assert 'one' in union      assert 'three' in intersection     assert 'ten' not in intersection     assert 'one' not in intersection

Also, in my case I found it useful to have a faster count_bits function for BitVector. Drop this code into BitVector 1.5 and it should give you a more performant bit counting method:

def fast_count_bits( self, v ):     bits = (             0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,             1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,             1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,             2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,             1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,             2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,             2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,             3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,             1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,             2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,             2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,             3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,             2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,             3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,             3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,             4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8 )      return bits[v & 0xff] + bits[(v >> 8) & 0xff] + bits[(v >> 16) & 0xff] + bits[v >> 24]

161

answered Sep 17 '22 04:09

Ryan Cox

Related questions
                            
                                How does this lambda/yield/generator comprehension work?
                            
                                Explain Apache Beam python syntax
                            
                                Attributes initialization/declaration in Python class: where to place them?
                            
                                Why do some built-in Python functions only have pass?
                            
                                Accepting a dictionary as an argument with argparse and python [duplicate]
                            
                                What is the opposite of python's ord() function?
                            
                                Can I get PyCharm to suppress a particular warning on a single line?
                            
                                How does one debug NaN values in TensorFlow?
                            
                                What does `**` mean in the expression `dict(d1, **d2)`?
                            
                                What are the advantages of using numpy.identity over numpy.eye?
                            
                                Unable to get local issuer certificate when using requests in python
                            
                                Run code after flask application has started
                            
                                How to get matplotlib figure size
                            
                                Python: 'Private' module in a package
                            
                                What's the exact usage of __reduce__ in Pickler
                            
                                Import statement inside class/function definition - is it a good idea?
                            
                                Hide traceback unless a debug flag is set
                            
                                Why is using thread locals in Django bad?
                            
                                Using print() (the function version) in Python2.x
                            
                                What are the differences between Python Dictionaries vs Javascript Objects?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Modern, high performance bloom filter in Python? [closed]

Tags:

python

jython

bloom-filter

Parand

People also ask

1 Answers

Ryan Cox

Recent Activity

Donate For Us