Does spark keep all elements of an RDD[K,V] for a particular key in a single partition after "groupByKey" even if the data for a key is very huge?

Tags:

Consider I have a PairedRDD of,say 10 partitions. But the keys are not evenly distributed, i.e, all the 9 partitions having data belongs to a single key say a and rest of the keys say b,c are there in last partition only.This is represented by the below figure: enter image description here

Now if I do a groupByKey on this rdd, from my understanding all data for same key will eventually go to different partitions or no data for the same key will not be in multiple partitions. Please correct me if I am wrong.

If that is the case then there can be a chance that the partition for key a can be of size that may not fit in a worker's RAM. In that case what spark will do ? My assumption is like it will spill the data to worker's disk. Is that correct? Or how spark handle such situations

540

asked Sep 18 '16 14:09

Tom Sebastian

1 Answers

Does spark keep all elements (...) for a particular key in a single partition after groupByKey

Yes, it does. This is a whole point of the shuffle.

the partition for key a can be of size that may not fit in a worker's RAM. In that case what spark will do

Size of a particular partition is not the biggest issue here. Partitions are represented using lazy Iterators and can easily store data which exceeds amount of available memory. The main problem is non-lazy local data structure generated in the process of grouping.

All values for the particular key are stored in memory as a CompactBuffer so a single large group can result in OOM. Even if each record separately fits in memory you may still encounter serious GC issues.

In general:

It is safe, although not optimal performance wise, to repartition data where amount of data assigned to a partition exceeds amount of available memory.
It is not safe to use PairRDDFunctions.groupByKey in the same situation.

Note: You shouldn't extrapolate this to different implementations of groupByKey though. In particular both Spark Dataset and PySpark RDD.groupByKey use more sophisticated mechanisms.

115

answered Nov 10 '22 13:11

zero323

Related questions
                            
                                PySpark "illegal reflective access operation" when executed in terminal
                            
                                Accesing Hdfs from Spark gives TokenCache error Can't get Master Kerberos principal for use as renewer
                            
                                pyspark: Save schemaRDD as json file
                            
                                Where does Spark actually persist RDDs on disk?
                            
                                Spark, MLlib: Adjusting classifier descrimination threshold
                            
                                Spark SQL 1.5 build failure
                            
                                How to get an Iterator of Rows using Dataframe in SparkSQL
                            
                                What is spark.streaming.receiver.maxRate? How does it work with batch interval
                            
                                spark.default.parallelism for Parallelize RDD defaults to 2 for spark submit
                            
                                How to perform "Lookup" operation on Spark dataframes given multiple conditions
                            
                                Use the result from Cross tab (spark dataframe) for chi-square test in SparkMlib
                            
                                Why Mutable map becomes immutable automatically in UserDefinedAggregateFunction(UDAF) in Spark
                            
                                Spark Scala Get Data Back from rdd.foreachPartition
                            
                                Is is possible to implemet all-pairs shortest path algorithm with parallel framework in large graph?
                            
                                Spark cluster Master IP address not binding to floating IP
                            
                                Zeppelin - Cannot query with %sql a table I registered with pyspark
                            
                                Not able to retrieve data from SparkR created DataFrame
                            
                                com.fasterxml.jackson.databind.JsonMappingException: Jackson version is too old 2.5.3
                            
                                Bulk data migration through Spark SQL
                            
                                SparkSQL on HBase Tables

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does spark keep all elements of an RDD[K,V] for a particular key in a single partition after "groupByKey" even if the data for a key is very huge?

Tags:

apache-spark

rdd

Tom Sebastian

People also ask

1 Answers

zero323

Recent Activity

Donate For Us