What I would like to do is:
Join two DataFrames A
and B
using their respective id
columns a_id
and b_id
. I want to select all columns from A
and two specific columns from B
I tried something like what I put below with different quotation marks but still not working. I feel in pyspark, there should have a simple way to do this.
A_B = A.join(B, A.id == B.id).select(A.*, B.b1, B.b2)
I know you could write
A_B = sqlContext.sql("SELECT A.*, B.b1, B.b2 FROM A JOIN B ON A.a_id = B.b_id")
to do this but I would like to do it more like the pseudo code above.
Here In first dataframe (dataframe1) , the columns ['ID', 'NAME', 'Address'] and second dataframe (dataframe2 ) columns are ['ID','Age']. Now we have to add the Age column to the first dataframe and NAME and Address in the second dataframe, we can do this by using lit() function. This function is available in pyspark.
Outer Join Using merge() Using merge() you can do merging by columns, merging by index, merging on multiple columns, and different join types. By default, it joins on all common columns that exist on both DataFrames and performs an inner join, to do an outer join use how param with outer value.
Your pseudocode is basically correct. This slightly modified version would work if the id
column existed in both DataFrames:
A_B = A.join(B, on="id").select("A.*", "B.b1", "B.b2")
From the docs for pyspark.sql.DataFrame.join()
:
If
on
is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join.
Since the keys are different, you can just use withColumn()
(or withColumnRenamed()
) to created a column with the same name in both DataFrames:
A_B = A.withColumn("id", col("a_id")).join(B.withColumn("id", col("b_id")), on="id")\
.select("A.*", "B.b1", "B.b2")
If your DataFrames have long complicated names, you could also use alias()
to make things easier:
A_B = long_data_frame_name1.alias("A").withColumn("id", col("a_id"))\
.join(long_data_frame_name2.alias("B").withColumn("id", col("b_id")), on="id")\
.select("A.*", "B.b1", "B.b2")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With