Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark with BloomFilter of billions of records causes Kryo serialization failed: Buffer overflow.

I have used Breeze implementation of Bloom filter in Apache spark. My Bloom Filter expects 200,000,000 keys.But i am facing below exception:

User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 5.0 failed 4 times, most recent failure: Lost task 1.3 in stage 5.0 (TID 161, SVDG0752.ideaconnect.com): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1 

I know to avoid this I can increase spark.kryoserializer.buffer.max value but due to the cluster resource limitations i can't increase it more than 2GB.

Below Is The Code:

val numOfBits=2147483647
val numOfHashFun=13
val bf = hierachyMatching.treeAggregate(new BloomFilter[String](numOfBits,numOfHashFun))(
  _ += _, _ |= _)

where hierachyMatching is the Rdd of type String containing 200M records.

My Questions:

  • How can i tackle this exception without increasing buffer.max value and How?
  • Is It possible to construct a bloom filter that contains more than 2billion bits on spark with the driver memory 6512mb and How?

Any ideas or suggestions related to this will be greatly appreciated. Thanks in advance.

like image 678
mayur Avatar asked Oct 30 '22 03:10

mayur


1 Answers

Try to specify spark.kryoserializer.buffer.max to 1 gb (or make experiment with this property so select better value) in spark-default.conf (or overridden properties) and restart your spark service, it should help you.

like image 131
Artem Avatar answered Nov 15 '22 09:11

Artem