Partition data for efficient joining for Spark dataframe/dataset

Tags:

I need to join many DataFrames together based on some shared key columns. For a key-value RDD, one can specify a partitioner so that data points with same key are shuffled to same executor so joining is more efficient (if one has shuffle related operations before the join). Can the same thing can be done on Spark DataFrames or DataSets?

228

asked Jan 09 '18 02:01

Rainfield

1 Answers

You can repartition a DataFrame after loading it if you know you'll be joining it multiple times

val users = spark.read.load("/path/to/users").repartition('userId)

val joined1 = users.join(addresses, "userId")
joined1.show() // <-- 1st shuffle for repartition

val joined2 = users.join(salary, "userId")
joined2.show() // <-- skips shuffle for users since it's already been repartitioned

So it'll shuffle the data once and then reuse the shuffle files when joining subsequent times.

However, if you know you'll be repeatedly shuffling data on certain keys, your best bet would be to save the data as bucketed tables. This will write the data out already pre-hash partitioned, so when you read the tables in and join them you avoid the shuffle. You can do so as follows:

// you need to pick a number of buckets that makes sense for your data
users.bucketBy(50, "userId").saveAsTable("users")
addresses.bucketBy(50, "userId").saveAsTable("addresses")

val users = spark.read.table("users")
val addresses = spark.read.table("addresses")

val joined = users.join(addresses, "userId")
joined.show() // <-- no shuffle since tables are co-partitioned

In order to avoid a shuffle, the tables have to use the same bucketing (e.g. same number of buckets and joining on the bucket columns).

142

answered Sep 22 '22 12:09

Silvio

Related questions
                            
                                Spark structured streaming kafka convert JSON without schema (infer schema)
                            
                                Class com.hadoop.compression.lzo.LzoCodec not found for Spark on CDH 5?
                            
                                Specifying an external configuration file for Apache Spark
                            
                                PySpark 1.5 How to Truncate Timestamp to Nearest Minute from seconds
                            
                                Spark 1.6-Failed to locate the winutils binary in the hadoop binary path
                            
                                Spark - Random Number Generation
                            
                                Could not bind on a random free port error while trying to connect to spark master
                            
                                EntityTooLarge error when uploading a 5G file to Amazon S3
                            
                                How to get ID of a map task in Spark?
                            
                                pyspark matrix with dummy variables
                            
                                Spark column string replace when present in other column (row)
                            
                                Converting a Spark Dataframe to a Scala Map collection
                            
                                How to change the column type from String to Date in DataFrames?
                            
                                Remove rows from dataframe based on condition in pyspark
                            
                                Matrix Transpose on RowMatrix in Spark
                            
                                PySpark computing correlation
                            
                                How to update column based on a condition (a value in a group)?
                            
                                AuthorizationException: User not allowed to impersonate User
                            
                                How to CROSS JOIN 2 dataframe?
                            
                                Installing Apache Spark on Ubuntu 14.04

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Partition data for efficient joining for Spark dataframe/dataset

Tags:

apache-spark

apache-spark-sql

apache-spark-dataset

partitioning

spark-dataframe

Rainfield

People also ask

1 Answers

Silvio

Recent Activity

Donate For Us