I have used Breeze implementation of Bloom filter in Apache spark. My Bloom Filter expects 200,000,000 keys.But i am facing below exception:
User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 5.0 failed 4 times, most recent failure: Lost task 1.3 in stage 5.0 (TID 161, SVDG0752.ideaconnect.com): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1
I know to avoid this I can increase spark.kryoserializer.buffer.max value but due to the cluster resource limitations i can't increase it more than 2GB.
Below Is The Code:
val numOfBits=2147483647
val numOfHashFun=13
val bf = hierachyMatching.treeAggregate(new BloomFilter[String](numOfBits,numOfHashFun))(
_ += _, _ |= _)
where hierachyMatching is the Rdd of type String containing 200M records.
My Questions:
Any ideas or suggestions related to this will be greatly appreciated. Thanks in advance.
Try to specify spark.kryoserializer.buffer.max
to 1 gb
(or make experiment with this property so select better value) in spark-default.conf
(or overridden properties) and restart your spark service, it should help you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With