Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Join two DataFrames where the join key is different and only select some columns

What I would like to do is:

Join two DataFrames A and B using their respective id columns a_id and b_id. I want to select all columns from A and two specific columns from B

I tried something like what I put below with different quotation marks but still not working. I feel in pyspark, there should have a simple way to do this.

A_B = A.join(B, A.id == B.id).select(A.*, B.b1, B.b2)

I know you could write

A_B = sqlContext.sql("SELECT A.*, B.b1, B.b2 FROM A JOIN B ON A.a_id = B.b_id")

to do this but I would like to do it more like the pseudo code above.

like image 699
ASU_TY Avatar asked Apr 06 '18 04:04

ASU_TY


People also ask

How do you join two DataFrames in Pyspark with different column names?

Here In first dataframe (dataframe1) , the columns ['ID', 'NAME', 'Address'] and second dataframe (dataframe2 ) columns are ['ID','Age']. Now we have to add the Age column to the first dataframe and NAME and Address in the second dataframe, we can do this by using lit() function. This function is available in pyspark.

How do you do an outer join in pandas?

Outer Join Using merge() Using merge() you can do merging by columns, merging by index, merging on multiple columns, and different join types. By default, it joins on all common columns that exist on both DataFrames and performs an inner join, to do an outer join use how param with outer value.


1 Answers

Your pseudocode is basically correct. This slightly modified version would work if the id column existed in both DataFrames:

A_B = A.join(B, on="id").select("A.*", "B.b1", "B.b2")

From the docs for pyspark.sql.DataFrame.join():

If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join.

Since the keys are different, you can just use withColumn() (or withColumnRenamed()) to created a column with the same name in both DataFrames:

A_B = A.withColumn("id", col("a_id")).join(B.withColumn("id", col("b_id")), on="id")\
    .select("A.*", "B.b1", "B.b2")

If your DataFrames have long complicated names, you could also use alias() to make things easier:

A_B = long_data_frame_name1.alias("A").withColumn("id", col("a_id"))\
    .join(long_data_frame_name2.alias("B").withColumn("id", col("b_id")), on="id")\
    .select("A.*", "B.b1", "B.b2")
like image 140
pault Avatar answered Sep 28 '22 08:09

pault