Cleanest, most efficient syntax to perform DataFrame self-join in Spark

Tags:

In standard SQL, when you join a table to itself, you can create aliases for the tables to keep track of which columns you are referring to:

SELECT a.column_name, b.column_name... FROM table1 a, table1 b WHERE a.common_field = b.common_field;

There are two ways I can think of to achieve the same thing using the Spark DataFrame API:

Solution #1: Rename the columns

There are a couple of different methods for this in answer to this question. This one just renames all the columns with a specific suffix:

df.toDF(df.columns.map(_ + "_R"):_*)

For example you can do:

df.join(df.toDF(df.columns.map(_ + "_R"):_*), $"common_field" === $"common_field_R")

Solution #2: Copy the reference to the DataFrame

Another simple solution is to just do this:

val df: DataFrame = .... val df_right = df  df.join(df_right, df("common_field") === df_right("common_field"))

Both of these solutions work, and I could see each being useful in certain situations. Are there any internal differences between the two I should be aware of?

825

asked Mar 27 '16 14:03

David Griffin

1 Answers

There are at least two different ways you can approach this either by aliasing:

df.as("df1").join(df.as("df2"), $"df1.foo" === $"df2.foo")

or using name-based equality joins:

// Note that it will result in ambiguous column names // so using aliases here could be a good idea as well. // df.as("df1").join(df.as("df2"), Seq("foo"))  df.join(df, Seq("foo"))

In general column renaming, while the ugliest, is the safest practice across all the versions. There have been a few bugs related to column resolution (we found one on SO not so long ago) and some details may differ between parsers (HiveContext / standard SQLContext) if you use raw expressions.

Personally I prefer using aliases because their resemblance to an idiomatic SQL and ability to use outside the scope of a specific DataFrame objects.

Regarding performance unless you're interested in close-to-real-time processing there should be no performance difference whatsoever. All of these should generate the same execution plan.

172

answered Sep 20 '22 08:09

zero323

Related questions
                            
                                Spark: Add column to dataframe conditionally
                            
                                How to run a script in PySpark
                            
                                I can't seem to get --py-files on Spark to work
                            
                                How Spark works internally
                            
                                How can I update a broadcast variable in spark streaming?
                            
                                scala.reflect.internal.MissingRequirementError: object java.lang.Object in compiler mirror not found
                            
                                Understanding Spark serialization
                            
                                Resolving dependency problems in Apache Spark
                            
                                Pivot String column on Pyspark Dataframe
                            
                                Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?
                            
                                What is the difference between rowsBetween and rangeBetween?
                            
                                Calculating the averages for each KEY in a Pairwise (K,V) RDD in Spark with Python
                            
                                How do I split an RDD into two or more RDDs?
                            
                                Encoder error while trying to map dataframe row to updated row
                            
                                How to convert unix timestamp to date in Spark
                            
                                NoClassDefFoundError com.apache.hadoop.fs.FSDataInputStream when execute spark-shell
                            
                                Drop spark dataframe from cache
                            
                                Why does spark-submit and spark-shell fail with "Failed to find Spark assembly JAR. You need to build Spark before running this program."?
                            
                                Spark using python: How to resolve Stage x contains a task of very large size (xxx KB). The maximum recommended task size is 100 KB
                            
                                How can I connect to a postgreSQL database into Apache Spark using scala?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cleanest, most efficient syntax to perform DataFrame self-join in Spark

Tags:

dataframe

apache-spark

apache-spark-sql

David Griffin

People also ask

1 Answers

zero323

Recent Activity

Donate For Us