Outer join Spark dataframe with non-identical join column and then merge join column

Tags:

Suppose I have the following dataframes in pySpark:

df1 = sqlContext.createDataFrame([Row(name='john', age=50), Row(name='james', age=25)])
df2 = sqlContext.createDataFrame([Row(name='john', weight=150), Row(name='mike', weight=115)])
df3 = sqlContext.createDataFrame([Row(name='john', age=50, weight=150), Row(name='james', age=25, weight=None), Row(name='mike', age=None, weight=115)])

Now suppose I want to create df3 from joining/merging df1 and df2.

I tried doing

df1.join(df2, df1.name == df2.name, 'outer')

This doesn't quite work exactly because it produces two name columns. I need to then somehow combine the two name columns so that missing names from one name column are filled in by the missing name from the other name column.

How would I do that? Or is there a better way to create df3 from df1 and df2?

490

asked Aug 23 '15 09:08

plam

2 Answers

You can use coallesce function which returns the first not-null argument.

from pyspark.sql.functions import coalesce

df1 = df1.alias("df1")
df2 = df2.alias("df2")

(df1.join(df2, df1.name == df2.name, 'outer')
  .withColumn("name_", coalesce("df1.name", "df2.name"))
  .drop("name")
  .withColumnRenamed("name_", "name"))

answered Dec 11 '22 14:12

zero323

This is a little late, but there is a simpler solution if someone needs it. Just a simple change from original poster's solution:

df1.join(df2, 'name', 'outer')

answered Dec 11 '22 14:12

V. Samma

Related questions
                            
                                Returning every element from a list (Python)
                            
                                HTTP GET packet sniffer in Scapy
                            
                                How to handle unique_ptr's with SWIG
                            
                                Return function with function
                            
                                Python inst/klass instead of self?
                            
                                pandas' read_sql with a list of values for WHERE condition
                            
                                Counting consecutive alphabets and hyphens and encode them as run length
                            
                                Kivy Look and Feel
                            
                                How to store indices in a list
                            
                                How to remove the last occurrence of an item from a list?
                            
                                Python syntax for namedtuple
                            
                                Numpy dtype for list with mixed data types
                            
                                Get substring from pandas dataframe while filtering
                            
                                How to add space around removed tags in BeautifulSoup
                            
                                Trying to download a file by URL with urllib.retrieve: 'module' object has no attribute 'retrieve'
                            
                                Installing Python module pandas in Cloud9
                            
                                Create a list of integers with duplicate values in Python
                            
                                Adding values for missing data combinations in Pandas
                            
                                Filtering Objects in Class based view Django using Query parameters?
                            
                                Testing whether a string has repeated characters

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Outer join Spark dataframe with non-identical join column and then merge join column

Tags:

python

join

apache-spark

apache-spark-sql

plam

People also ask

2 Answers

zero323

V. Samma

Recent Activity

Donate For Us