Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reducing with a bloom filter

I would like to get a fast approximate set membership, based on a String-valued function applied to a large Spark RDD of String Vectors (~1B records). Basically the idea would be to reduce into a Bloom filter. This bloom filter could then be broadcasted to the workers for further use.

More specifically, I currently have

rdd: RDD[Vector[String]]
f: Vector[String] => String
val uniqueVals = rdd.map(f).distinct().collect()
val uv = sc.broadcast(uniqueVals)

But uniqueVals is too large to be practical, and I would like to replace it with something of smaller (and known) size, i.e. a bloom filter.

My questions:

  • is it possible to reduce into a Bloom filter, or do I have to collect first, and then construct it in the driver?

  • is there a mature Scala/Java Bloom filter implementation available that would be suitable for this?

like image 223
mitchus Avatar asked Dec 25 '22 16:12

mitchus


1 Answers

Yes, Bloom filters can be reduced, because they have some nice properties (they are monoids). This means you can do all the aggregation operations in parallel, doing effectively just one pass over the data to construct the BloomFilter for each partition and then reduce those BloomFilters together to get a single BloomFilter that you can query for contains.

There are at least two implementations of BloomFilter in Scala and both seem mature projects (haven't actually used them in production). The first one is Breeze and the second one is Twitter's Algebird. Both contain implementations of different sketches and a lot more.

This is an example how to do that with Breeze:

import breeze.util.BloomFilter

val nums = List(1 to 20: _*).map(_.toString)
val rdd = sc.parallelize(nums, 5)

val bf = rdd.mapPartitions { iter =>
  val bf = BloomFilter.optimallySized[String](10000, 0.001)
  iter.foreach(i => bf += i)
  Iterator(bf)
}.reduce(_ | _)

println(bf.contains("5")) // true
println(bf.contains("31")) // false
like image 111
Grega Kešpret Avatar answered Jan 05 '23 09:01

Grega Kešpret