How to join big dataframes in Spark SQL? (best practices, stability, performance)

Tags:

I'm getting the same error than Missing an output location for shuffle when joining big dataframes in Spark SQL. The recommendation there is to set MEMORY_AND_DISK and/or spark.shuffle.memoryFraction 0. However, spark.shuffle.memoryFraction is deprecated in Spark >= 1.6.0 and setting MEMORY_AND_DISK shouldn't help if I'm not caching any RDD or Dataframe, right? Also I'm getting lots of other WARN logs and task retries that lead me to think that the job is not stable.

Therefore, my question is:

What are best practices to join huge dataframes in Spark SQL >= 1.6.0?

leo9r

1 Answers

That are a lot of questions. Allow me to answer these one by one:

Your number of executors is most of the time variable in a production environment. This depends on the available resources. The number of partitions is important when you are performing shuffles. Assuming that your data is now skewed, you can lower the load per task by increasing the number of partitions. A task should ideally take a couple of minus. If the task takes too long, it is possible that your container gets pre-empted and the work is lost. If the task takes only a few milliseconds, the overhead of starting the task gets dominant.

The level of parallelism and tuning your executor sizes, I would like to refer to the excellent guide by Cloudera: https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

ORC and Parquet only encode the data at rest. When doing the actual join, the data is in the in-memory format of Spark. Parquet is getting more popular since Netflix and Facebook adopted it and put a lot of effort in it. Parquet allows you to store the data more efficient and has some optimisations (predicate pushdown) that Spark uses.

You should use the SQLContext instead of the HiveContext, since the HiveContext is deprecated. The SQLContext is more general and doesn't only work with Hive.

When performing the registerTempTable, the data is stored within the SparkSession. This doesn't affect the execution of the join. What it stores is only the execution plan which gets invoked when an action is performed (for example saveAsTable). When performining a saveAsTable the data gets stored on the distributed file system.

Hope this helps. I would also suggest watching our talk at the Spark Summit about doing joins: https://www.youtube.com/watch?v=6zg7NTw-kTQ. This might provide you some insights.

Cheers, Fokko

answered Oct 11 '22 12:10

Fokko Driesprong

Related questions
                            
                                What this phrase "Try to make you architecture more horizontal rather than vertical" means?
                            
                                Xcode suddenly becomes very slow
                            
                                What are the implications of using SingletonEhCacheRegionFactory vs. EhCacheRegionFactory for Hibernate 2nd-level cache in a Web Application?
                            
                                Improving WPF performance by breaking up the UI into 'regions' - is this possible?
                            
                                javascript performance: global variable vs jquery's $.data()
                            
                                LINQ Count() until, is this more efficient?
                            
                                Camel ActiveMQ Performance Tuning
                            
                                How does WPF optimise the layout / rendering cycle?
                            
                                OpenGL text rendering methods and trade-offs
                            
                                Unexpected Performance Penalty in Java
                            
                                Using drawImage with Canvas is very slow on Chrome
                            
                                Why python+sqlite3 is extremely slow?
                            
                                Python time.sleep vs busy wait accuracy
                            
                                How to improve INSERT performance on a very large MySQL table
                            
                                Why mesh python code slower than decomposed one?
                            
                                Best open-source grid with smooth, infinite scrolling
                            
                                Performance of choice vs randint
                            
                                Optimize a list function that creates too much garbage (not stack overflow)
                            
                                What is faster in Ruby, `arr += [x]` or `arr << x`
                            
                                Should I use std::set or std::unordered_set for a set of pointers?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to join big dataframes in Spark SQL? (best practices, stability, performance)

Tags:

performance

join

apache-spark

apache-spark-sql

spark-dataframe

leo9r

People also ask

1 Answers

Fokko Driesprong

Recent Activity

Donate For Us