Spark : Size exceeds Integer.MAX_VALUE When Joining 2 Large DFs

Question

Folks,

I am running into this issue while I am trying to join 2 large dataframes (100GB + each) in spark on one single key identifier per row.

I am using Spark 1.6 on EMR and here is what I am doing :

val df1 = sqlContext.read.json("hdfs:///df1/")
val df2 = sqlContext.read.json("hdfs:///df2/")

// clean up and filter steps later 

df1.registerTempTable("df1")
df2.registerTempTable("df2")

val df3 = sql("select df1.*, df2.col1 from df1 left join df2 on df1.col3 = df2.col4")

df3.write.json("hdfs:///df3/")

This is basically the gist of what I am doing, among other clean-up and filter steps in between to join df1 and df2 finally.

Error I am seeing is :

java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
    at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:860)
    at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:127)
    at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:115)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
    at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:129)
    at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:136)
    at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:503)
    at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:420)
    at org.apache.spark.storage.BlockManager.get(BlockManager.scala:625)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Configuration and References :
I am using 13 node 60GB each cluster with executors and driver memory set accordingly with overheads. Things I have tried adjusting :

spark.sql.broadcastTimeout
spark.sql.shuffle.partitions

I have also tried using bigger cluster, but did not help. This link says if Shuffle partition size exceeds 2GB, this error is thrown. But I have tried increasing number of partitions to a much high value, still no luck.

I suspect this could be something related to lazy loading. When I do 10 operations on a DF, they are only executed at the last step. I tried adding .persist() on various storage levels for DFs, still it doesn't succeed. I have also tried dropping temp tables, emptying all the earlier DFs for clean up.

However the code works if I break it down into 2 parts - writing the final temp data (2 data frames) to the disk, exiting. Restarting to only join the two DFs.

I was earlier getting this error :

Exception in thread "main" java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
    at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
    at scala.concurrent.Await$.result(package.scala:107)
    at org.apache.spark.sql.execution.joins.BroadcastHashOuterJoin.doExecute(BroadcastHashOuterJoin.scala:113)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
    at org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:46)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
    at org.apache.spark.sql.DataFrame.toJSON(DataFrame.scala:1724)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

But when I adjusted spark.sql.broadcastTimeout, I started getting the first error.

Would appreciate any help in this case. I can add more info if needed.

Sruthi Poddutur · Accepted Answer

In spark u can’t have shuffle block larger than 2GB. This is because, Spark stores shuffle blocks as ByteBuffer. Here’s how you allocate it:

ByteBuffer.allocate(int capacity)

As, ByteBuffer’s are limited by Integer.MAX_SIZE (2GB), So are shuffle blocks!! Solution is to Increase number of partitions either by using spark.sql.shuffle.partitions in SparkSQL or by rdd.partition() or rdd.colease() for rdd's such that each partition size is <= 2GB.

You mentioned that you tried to increase num of partitions and still it failed. Can you check if the partition size was > 2GB. Just make sure that number of partitions specified is sufficient enough to make each block size < 2GB

Spark : Size exceeds Integer.MAX_VALUE When Joining 2 Large DFs

Tags:

scala

apache-spark

apache-spark-sql

rohitkulky

1 Answers

Sruthi Poddutur

Recent Activity

Donate For Us

Spark : Size exceeds Integer.MAX_VALUE When Joining 2 Large DFs

Tags:

scala

apache-spark

apache-spark-sql

rohitkulky

1 Answers

Sruthi Poddutur

Related questions

Recent Activity

Donate For Us