Spark with BloomFilter of billions of records causes Kryo serialization failed: Buffer overflow.

Question

I have used Breeze implementation of Bloom filter in Apache spark. My Bloom Filter expects 200,000,000 keys.But i am facing below exception:

User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 5.0 failed 4 times, most recent failure: Lost task 1.3 in stage 5.0 (TID 161, SVDG0752.ideaconnect.com): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1

I know to avoid this I can increase spark.kryoserializer.buffer.max value but due to the cluster resource limitations i can't increase it more than 2GB.

Below Is The Code:

val numOfBits=2147483647
val numOfHashFun=13
val bf = hierachyMatching.treeAggregate(new BloomFilter[String](numOfBits,numOfHashFun))(
  _ += _, _ |= _)

where hierachyMatching is the Rdd of type String containing 200M records.

My Questions:

How can i tackle this exception without increasing buffer.max value and How?
Is It possible to construct a bloom filter that contains more than 2billion bits on spark with the driver memory 6512mb and How?

Any ideas or suggestions related to this will be greatly appreciated. Thanks in advance.

Artem · Accepted Answer

Try to specify spark.kryoserializer.buffer.max to 1 gb (or make experiment with this property so select better value) in spark-default.conf (or overridden properties) and restart your spark service, it should help you.

Spark with BloomFilter of billions of records causes Kryo serialization failed: Buffer overflow.

Tags:

scala

apache-spark

bloom-filter

bigdata

mayur

1 Answers

Artem

Recent Activity

Donate For Us

Spark with BloomFilter of billions of records causes Kryo serialization failed: Buffer overflow.

Tags:

scala

apache-spark

bloom-filter

bigdata

mayur

1 Answers

Artem

Related questions

Recent Activity

Donate For Us