Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does spark keep all elements of an RDD[K,V] for a particular key in a single partition after "groupByKey" even if the data for a key is very huge?

Consider I have a PairedRDD of,say 10 partitions. But the keys are not evenly distributed, i.e, all the 9 partitions having data belongs to a single key say a and rest of the keys say b,c are there in last partition only.This is represented by the below figure: enter image description here

Now if I do a groupByKey on this rdd, from my understanding all data for same key will eventually go to different partitions or no data for the same key will not be in multiple partitions. Please correct me if I am wrong.

If that is the case then there can be a chance that the partition for key a can be of size that may not fit in a worker's RAM. In that case what spark will do ? My assumption is like it will spill the data to worker's disk. Is that correct? Or how spark handle such situations

like image 540
Tom Sebastian Avatar asked Sep 18 '16 14:09

Tom Sebastian


People also ask

How does Spark work with RDD?

The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-memory processing computation. This means, it stores the state of memory as an object across the jobs and the object is sharable between those jobs. Data sharing in memory is 10 to 100 times faster than network and Disk.

When using the default storage level for persist () What happens when an RDD does not fit in the cluster's memory?

This is the default level. Stores RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions on disk that don't fit in memory, and read them from there when they're needed. Store RDD as serialized Java objects (one byte array per partition).

Which of the following method does RDD use in order to persist data permanent in the memory?

We can make persisted RDD through cache() and persist() methods. When we use the cache() method we can store all the RDD in-memory. We can persist the RDD in memory and use it efficiently across parallel operations.

In which classes does Spark provide the full implementation of key-value based RDD?

RDDs are implemented in Scala, Java, Python and R. Spark itself is implemented in Scala, internally using the Akka actor framework to handle distributed state and Netty to handle networking. For Python, Spark uses Py4J, which allows Python programs to access Java objects in a remote JVM.


1 Answers

Does spark keep all elements (...) for a particular key in a single partition after groupByKey

Yes, it does. This is a whole point of the shuffle.

the partition for key a can be of size that may not fit in a worker's RAM. In that case what spark will do

Size of a particular partition is not the biggest issue here. Partitions are represented using lazy Iterators and can easily store data which exceeds amount of available memory. The main problem is non-lazy local data structure generated in the process of grouping.

All values for the particular key are stored in memory as a CompactBuffer so a single large group can result in OOM. Even if each record separately fits in memory you may still encounter serious GC issues.

In general:

  • It is safe, although not optimal performance wise, to repartition data where amount of data assigned to a partition exceeds amount of available memory.
  • It is not safe to use PairRDDFunctions.groupByKey in the same situation.

Note: You shouldn't extrapolate this to different implementations of groupByKey though. In particular both Spark Dataset and PySpark RDD.groupByKey use more sophisticated mechanisms.

like image 115
zero323 Avatar answered Nov 10 '22 13:11

zero323