Hoes does Spark schedule a join?

Tags:

apache-spark

I am joining two RDDs rddA and rddB.

rddA has 100 partitions and rddB has 500 partitions.

I am trying to understand the mechanics of the join operation. By default, regardless of the order of the join I end up with the same partition structure; i.e. rddA.join(rddB) and rddB.join(rddA) yields the same number of partitions, and by observation it uses the smaller partition size, 100. I am aware that I can increase the partition size by using rddA.join(rddB,500), but I am more interested in what takes place under the hood and why the lower size is chosen. From observation, even if I re-partition the small rdd, its partitioning will still be used; does Spark do any heuristic analysis regarding the key size?

Another problem I have is the level of skewness I get. My smaller partition ends up with 3,314 entries and the larger ends up with 1,139,207 out of a total size 599,911,729(keys). Both RDDs are using the default partitioner, so how is the data shuffle decided? I vaguely recall reading that if one rdd has a partitioner set, then it is that partitioner that will be used. Is this the case? Is it "recommended" to do this?

Finally, note that both of my rdds are relatively big (~90GB) so a broadcast join would not help. Instead, any way to provide some insights to the join operation would probably be the way to go.

PS. Any details on the mechanics left and right join would be an added bonus :)

674

asked May 23 '15 11:05

Ioannis Deligiannis

1 Answers

Although I have not yet managed to explain how partitioning is derived, I did find out how data are shuffled (which was my initial problem). A join has a few side-effects:

Shuffling/Partitioning: Spark will Hash partition 'RDD' keys and move/distribute among 'Workers'. Each set of values for a given key(e.g. 5), will end up in a single 'Worker'/JVM. This mean that if your 'join' has a 1..N relationship and N is heavily skewed, you will end up with skewed partitions and JVM heaps (i.e. one 'Partition' might have Max(N) and the other Min(N)). The only way to avoid this is to use a 'Broadcast' if possible or endure this behaviour. As you data will initially be evenly distributed, the amount of shuffling will be dependant on the key hash.

Re-partitioning: Following a "skewed" join, calling 'repartition' seems to evenly re-distribute data among partitions. So this is a good thing to do, if you have unavoidable skewness issues. Note though that this transformation will trigger a heavy shuffle, but following operations will be much faster. The downside to this though is uncontrollable Object creation (See below)

Object creation/Heap pollution: You managed to join your data think that repartitioning would be a good idea to re-balance your cluster, but for some reason, 'repartition' triggers an 'OOME'. What happens is that the originally joined data re-uses joined Objects. When you trigger 'repartition' or any other 'Action' that involved shuffling e.g. an extra join or 'groupBy' (followed by an 'Action'), data gets serialized so you lose Object re-use. Once Objects are de-serialized they are now new instances. Also note than during serialization the re-use is lost, so the suffle will be quite heavy. So, in my case, a 1..1000000 join (where 1 is my 'heavy' object), will fail following any action that triggers a shuffle.

Workarounds/Debug:

I used 'mapPartitionsWithIndex' to debug partition sizes by returning a single item 'Iterable>' with the count of each partitions. This is very useful as you can see the effect of 'repartition' and the state of your partitions after an 'Action'.
You can use 'countByKeyApprox' or 'countByKey' on your join RDDs to get a feel of the cardinality and then applied the join in two steps. Use a 'Broadcast' for you high cardinality keys and a 'join' for the low cardinality keys. Wrapping these operations in a 'rdd.cache()' & 'rdd.unpersist()' block will speed this process significantly. Though this might complicate your code a little, it will provide much better performance especially if you do subsequent operations. Also note, that if you use the 'Broadcast' in every 'map', to do a lookup, you will also significantly reduce shuffling size.
Call 'repartition' of other operations that affect the number of partitions can be very useful, but be aware that a (randomly) large number of partitions will cause more skenewess, as your large sets for given key will create large partitions, but the other partitions will have a small size or 0. Creating a debug method to get the size of partitioning will help you pick a good size.

104

answered Sep 25 '22 15:09

Ioannis Deligiannis

Related questions
                            
                                What means an "E" or "e" character on larger fractional and float outputs?
                            
                                Apache Mina SFTP server side channel listener for incoming files
                            
                                Difference between java.nio.file.Files and java.io.File? [closed]
                            
                                Chrome dropped Java support
                            
                                QuickFIX/J Error value out of range for this tag
                            
                                How to set the consumer-tag value in spring-amqp
                            
                                Can two instances be created at the same time?
                            
                                How to load nested key value pairs from a properties file in SpringBoot
                            
                                junit gives error java.lang.NoClassDefFoundError: junit/framework/JUnit4TestAdapterCache
                            
                                Removing anonymous listener
                            
                                Website won't load in android webview but works fine in android browsers
                            
                                Map Lookup Efficiency of TestForNull
                            
                                Multi-threading in EJB's
                            
                                Best data structure to "group by" and aggregate values in Java?
                            
                                Does it matter what version of JDK I use in Android Studio?
                            
                                How to Round a BigDecimal Value to its nearest hundreths
                            
                                How to implement linear interpolation method in java array?
                            
                                Using Java multithreading, what is the most efficient to coordinate finding the best result?
                            
                                Java 8 stream from modified collection
                            
                                argument redirectattributes is of type model or map but is not assignable from the actual model

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Hoes does Spark schedule a join?

Tags:

java

apache-spark

Ioannis Deligiannis

People also ask

1 Answers

Ioannis Deligiannis

Recent Activity

Donate For Us