Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Full outer join in pyspark data frames

I have created two data frames in pyspark like below. In these data frames I have column id. I want to perform a full outer join on these two data frames.

valuesA = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
a = sqlContext.createDataFrame(valuesA,['name','id'])

a.show()
+---------+---+
|     name| id|
+---------+---+
|   Pirate|  1|
|   Monkey|  2|
|    Ninja|  3|
|Spaghetti|  4|
+---------+---+


valuesB = [('dave',1),('Thor',2),('face',3), ('test',5)]
b = sqlContext.createDataFrame(valuesB,['Movie','id'])

b.show()
+-----+---+
|Movie| id|
+-----+---+
| dave|  1|
| Thor|  2|
| face|  3|
| test|  5|
+-----+---+


full_outer_join = a.join(b, a.id == b.id,how='full')
full_outer_join.show()

+---------+----+-----+----+
|     name|  id|Movie|  id|
+---------+----+-----+----+
|   Pirate|   1| dave|   1|
|   Monkey|   2| Thor|   2|
|    Ninja|   3| face|   3|
|Spaghetti|   4| null|null|
|     null|null| test|   5|
+---------+----+-----+----+

I want to have a result like below when I do a full_outer_join

+---------+-----+----+
|     name|Movie|  id|
+---------+-----+----+
|   Pirate| dave|   1|
|   Monkey| Thor|   2|
|    Ninja| face|   3|
|Spaghetti| null|   4|
|     null| test|   5|
+---------+-----+----+

I have done like below but getting some different result

full_outer_join = a.join(b, a.id == b.id,how='full').select(a.id, a.name, b.Movie)
full_outer_join.show()
+---------+----+-----+
|     name|  id|Movie|
+---------+----+-----+
|   Pirate|   1| dave|
|   Monkey|   2| Thor|
|    Ninja|   3| face|
|Spaghetti|   4| null|
|     null|null| test|
+---------+----+-----+

As you can see that I am missing Id 5 in my result data frame.

How can I achieve what I want?

like image 793
User12345 Avatar asked May 08 '18 18:05

User12345


People also ask

What is full outer join in PySpark?

4. PySpark Full Outer Join. Outer a.k.a full , fullouter join returns all rows from both datasets, where join expression doesn't match it returns null on respective record columns.

How do I join a DataFrame in PySpark?

Summary: Pyspark DataFrames have a join method which takes three parameters: DataFrame on the right side of the join, Which fields are being joined on, and what type of join (inner, outer, left_outer, right_outer, leftsemi). You call the join method from the left side DataFrame object such as df1. join(df2, df1.

How does PySpark outer join work?

When you join two DataFrames using a full outer join (full outer), It returns all rows from both datasets, where the join expression doesn't match it returns null on respective columns.


2 Answers

Since the join columns have the same name, you can specify the join columns as a list:

a.join(b, ['id'], how='full').show()
+---+---------+-----+
| id|     name|Movie|
+---+---------+-----+
|  5|     null| test|
|  1|   Pirate| dave|
|  3|    Ninja| face|
|  2|   Monkey| Thor|
|  4|Spaghetti| null|
+---+---------+-----+

Or coalesce the two id columns:

import pyspark.sql.functions as F
a.join(b, a.id == b.id, how='full').select(
    F.coalesce(a.id, b.id).alias('id'), a.name, b.Movie
).show()
+---+---------+-----+
| id|     name|Movie|
+---+---------+-----+
|  5|     null| test|
|  1|   Pirate| dave|
|  3|    Ninja| face|
|  2|   Monkey| Thor|
|  4|Spaghetti| null|
+---+---------+-----+
like image 171
Psidom Avatar answered Sep 20 '22 18:09

Psidom


You can either reaname the column id from the dataframe b and drop later or use the list in join condition.

a.join(b, ['id'], how='full')

Output:

+---+---------+-----+
|id |name     |Movie|
+---+---------+-----+
|1  |Pirate   |dave |
|3  |Ninja    |face |
|5  |null     |test |
|4  |Spaghetti|null |
|2  |Monkey   |Thor |
+---+---------+-----+
like image 28
koiralo Avatar answered Sep 22 '22 18:09

koiralo