Why does Spark RDD partition has 2GB limit for HDFS?

Tags:

I have get an error when using mllib RandomForest to train data. As my dataset is huge and the default partition is relative small. so an exception thrown indicating that "Size exceeds Integer.MAX_VALUE" ,the orignal stack trace as following,

15/04/16 14:13:03 WARN scheduler.TaskSetManager: Lost task 19.0 in stage 6.0 (TID 120, 10.215.149.47): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517) at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432) at org.apache.spark.storage.BlockManager.get(BlockManager.scala:618) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:146) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)

The Integer.MAX_SIZE is 2GB, it seems that some partition out of memory. So i repartiton my rdd partition to 1000, so that each partition could hold far less data as before. Finally, the problem is solved!!!

So, my question is : Why partition size has the 2G limit? It seems that there is no configure set for the limit in the spark

274

asked Apr 17 '15 03:04

bourneli

2 Answers

The basic abstraction for blocks in spark is a ByteBuffer, which unfortunately has a limit of Integer.MAX_VALUE (~2GB).

It is a critical issue which prevents use of spark with very large datasets. Increasing the number of partitions can resolve it (like in OP's case), but is not always feasible, for instance when there is large chain of transformations part of which can increase data (flatMap etc) or in cases where data is skewed.

The solution proposed is to come up with an abstraction like LargeByteBuffer which can support list of bytebuffers for a block. This impacts overall spark architecture, so it has remained unresolved for quite a while.

100

answered Oct 26 '22 23:10

Shyamendra Solanki

the problem is when using datastores like Casandra, HBase, or Accumulo the block size is based on the datastore splits (which can be over 10 gig). when loading data from these datastores you have to immediately repartitions with 1000s of partitions so you can operated the data without blowing the 2gig limit.

most people that use spark are not really using large data; to them if it is bigger that excel can hold or tableau is is big data to them; mostly data scientist who use quality data or use a sample size small enough to work with the limit.

when processing large volumes of data, i end of having to go back to mapreduce and only used spark once the data has been cleaned up. this is unfortunate however, the majority of the spark community has no interest in addressing the issue.

a simple solution would be to create an abstraction and use bytearray as default; however, allow to overload a spark job with an 64bit data pointer to handle the large jobs.

answered Oct 26 '22 23:10

Bulldog20630405

Related questions
                            
                                How to convert a nested scala collection to a nested Java collection
                            
                                Why is dataset.count causing a shuffle! (spark 2.2)
                            
                                GUI programming in Scala
                            
                                Scala - ambiguous reference to overloaded definition -- with varargs [duplicate]
                            
                                Scala: illegal inheritance; self-type Y does not conform to X's selftype SELF
                            
                                Comparison of Scala (latest 2.10) versus Groovy++ (latest 0.9.1?) [closed]
                            
                                Play framework how do sessions and cookies work?
                            
                                How does the memory management of closures in Scala work?
                            
                                Scala: How can I install a package system wide for working with in the repl?
                            
                                What's the correct way to enforce constraints on case class values
                            
                                Extract information from a `org.apache.spark.sql.Row`
                            
                                Refined and existential types for runtime values
                            
                                Scala: How to combine parser combinators from different objects
                            
                                Complexity of List.reverse?
                            
                                Using scalaz state in a more complicated computation
                            
                                Spray routing 404 response
                            
                                How to instantiate trait which extends class with constructor
                            
                                Doobie and DB access composition within 1 transaction
                            
                                How to convert an anonymous function to a method value?
                            
                                Why asInstanceOf doesn't throw a ClassCastException?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does Spark RDD partition has 2GB limit for HDFS?

Tags:

scala

apache-spark

rdd

bourneli

People also ask

2 Answers

Shyamendra Solanki

Bulldog20630405

Recent Activity

Donate For Us