Full outer join in pyspark data frames

Tags:

apache-spark

pyspark

I have created two data frames in pyspark like below. In these data frames I have column id. I want to perform a full outer join on these two data frames.

valuesA = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
a = sqlContext.createDataFrame(valuesA,['name','id'])

a.show()
+---------+---+
|     name| id|
+---------+---+
|   Pirate|  1|
|   Monkey|  2|
|    Ninja|  3|
|Spaghetti|  4|
+---------+---+


valuesB = [('dave',1),('Thor',2),('face',3), ('test',5)]
b = sqlContext.createDataFrame(valuesB,['Movie','id'])

b.show()
+-----+---+
|Movie| id|
+-----+---+
| dave|  1|
| Thor|  2|
| face|  3|
| test|  5|
+-----+---+


full_outer_join = a.join(b, a.id == b.id,how='full')
full_outer_join.show()

+---------+----+-----+----+
|     name|  id|Movie|  id|
+---------+----+-----+----+
|   Pirate|   1| dave|   1|
|   Monkey|   2| Thor|   2|
|    Ninja|   3| face|   3|
|Spaghetti|   4| null|null|
|     null|null| test|   5|
+---------+----+-----+----+

I want to have a result like below when I do a full_outer_join

+---------+-----+----+
|     name|Movie|  id|
+---------+-----+----+
|   Pirate| dave|   1|
|   Monkey| Thor|   2|
|    Ninja| face|   3|
|Spaghetti| null|   4|
|     null| test|   5|
+---------+-----+----+

I have done like below but getting some different result

full_outer_join = a.join(b, a.id == b.id,how='full').select(a.id, a.name, b.Movie)
full_outer_join.show()
+---------+----+-----+
|     name|  id|Movie|
+---------+----+-----+
|   Pirate|   1| dave|
|   Monkey|   2| Thor|
|    Ninja|   3| face|
|Spaghetti|   4| null|
|     null|null| test|
+---------+----+-----+

As you can see that I am missing Id 5 in my result data frame.

How can I achieve what I want?

793

asked May 08 '18 18:05

User12345

2 Answers

Since the join columns have the same name, you can specify the join columns as a list:

a.join(b, ['id'], how='full').show()
+---+---------+-----+
| id|     name|Movie|
+---+---------+-----+
|  5|     null| test|
|  1|   Pirate| dave|
|  3|    Ninja| face|
|  2|   Monkey| Thor|
|  4|Spaghetti| null|
+---+---------+-----+

Or coalesce the two id columns:

import pyspark.sql.functions as F
a.join(b, a.id == b.id, how='full').select(
    F.coalesce(a.id, b.id).alias('id'), a.name, b.Movie
).show()
+---+---------+-----+
| id|     name|Movie|
+---+---------+-----+
|  5|     null| test|
|  1|   Pirate| dave|
|  3|    Ninja| face|
|  2|   Monkey| Thor|
|  4|Spaghetti| null|
+---+---------+-----+

171

answered Sep 20 '22 18:09

Psidom

You can either reaname the column id from the dataframe b and drop later or use the list in join condition.

a.join(b, ['id'], how='full')

Output:

+---+---------+-----+
|id |name     |Movie|
+---+---------+-----+
|1  |Pirate   |dave |
|3  |Ninja    |face |
|5  |null     |test |
|4  |Spaghetti|null |
|2  |Monkey   |Thor |
+---+---------+-----+

answered Sep 22 '22 18:09

koiralo

Related questions
                            
                                Building a StructType from a dataframe in pyspark
                            
                                How to select last row and also how to access PySpark dataframe by index?
                            
                                How to connect to remote hive server from spark [duplicate]
                            
                                Is dataframe.show() an action in spark?
                            
                                dynamically bind variable/parameter in Spark SQL?
                            
                                Spark UI on AWS EMR
                            
                                How to fix java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List to field type scala.collection.Seq?
                            
                                Why does Scala compiler fail with "no ': _*' annotation allowed here" when Row does accept varargs?
                            
                                Scala Error: Could not find or load main class in both Scala IDE and Eclipse
                            
                                How to configure Apache Spark random worker ports for tight firewalls?
                            
                                Where is the Spark UI on Google Dataproc?
                            
                                How to convert ArrayType to DenseVector in PySpark DataFrame?
                            
                                Executing separate streaming queries in spark structured streaming
                            
                                Unable to run a basic GraphFrames example
                            
                                unexpected type: <class 'pyspark.sql.types.DataTypeSingleton'> when casting to Int on a ApacheSpark Dataframe
                            
                                Link Spark with iPython Notebook
                            
                                How to fix "java.io.NotSerializableException: org.apache.kafka.clients.consumer.ConsumerRecord" in Spark Streaming Kafka Consumer?
                            
                                Efficient way to read specific columns from parquet file in spark
                            
                                How to overwrite entire existing column in Spark dataframe with new column?
                            
                                Read whole text files from a compression in Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With