Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient implementation of a Bloom filter in C?

Tags:

c

bloom-filter

This question has been asked previously but there was no answer for it at that time so I decided to ask it again.

I need an efficient implementation of a Bloom filter in C (not C++). If there is no such thing available, I would not mind implementing one, if given some good reference so that it doesn't take too much of my time.

I want to use this data structure for inserts and tests in a ratio (1:20k), so primarily it is test-intensive. The data to be tested is 64 bit integers.

like image 705
Aman Deep Gautam Avatar asked Jun 13 '12 02:06

Aman Deep Gautam


People also ask

How is Bloom filter implemented?

Implementation. An empty Bloom filter is a bit array of m bits, all set to 0. There are also k different hash functions, each of which maps a set element to one of the m bit positions. To add an element, feed it to the hash functions to get k bit positions, and set the bits at these positions to 1.

Why is Bloom filter space efficient?

The advantage of a Bloom filter over the established dictionary structures is space efficiency. A Bloom filter needs only a constant number of bits per (prospective) element, while keeping the FPR constant, independent of the size of the elements' universe.

What is Bloom filter explain Bloom filter process with neat diagram?

A Bloom filter is defined as a data structure designed to identify of a element's presence in a set in a rapid and memory efficient manner. A specific data structure named as probabilistic data structure is implemented as bloom filter.

How Bloom filter is useful for big data analytics explain with one example?

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. For example, checking availability of username is set membership problem, where the set is the list of all registered username.


2 Answers

I have a stand-alone plain C library here which may be of use: https://github.com/jvirkki/libbloom

like image 139
Jyri J. Virkki Avatar answered Oct 04 '22 23:10

Jyri J. Virkki


Not to do too much self-promotion, but I've written a plugin for the Geany editor/IDE that filters out duplicate text lines, it uses a Bloom filter.

The implementation is in C, and you can find it right here on GitHub. It's GPL v3, so depending on your exact needs you may or may not be able to use it.

Some notes about my implementation:

  • It's designed to filter strings, and doesn't abstract the key type. This means you're going to have to modify the key handling to suit your needs.
  • It supports un-characteristic semantics, you can actually use it for totally non-probabilistic existance-testing if you want to (see the BloomContains callback function pointer used by bloom_filter_new()). Just pass NULL to get a "pure" filter.
  • The string hash function is MurmurHash2 by Austin Appleby. I evaluated the more current MurmurHash3, but version 2 was easier to work with.
  • To fit in the Geany eco system, this code uses GLib types throughout.

It hasn't been heavily tuned for performance, but should be okay. I would appreciate any feedback you might have after testing it, of course!

like image 26
unwind Avatar answered Oct 05 '22 00:10

unwind