Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple consecutive join with pyspark

I'm trying to join multiple DF together. Because how join work, I got the same column name duplicated all over.

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.

# Join Min and Max to S1
joinned_s1 = (minTime.join(maxTime, minTime["UserId"] == maxTime["UserId"]))

# Join S1 and sum to s2
joinned_s2 = (joinned_s1.join(sumTime, joinned_s1["UserId"] == sumTime["UserId"]))

I got this error: ""Reference 'UserId' is ambiguous, could be: UserId#1578, UserId#3014.;"

What is the proper way of removing W from my dataset once successfully joined?

like image 792
Ahmet Avatar asked Jul 19 '16 21:07

Ahmet


People also ask

What is join in pyspark?

PySpark Join is used to combine two DataFrames and by chaining these you can join multiple DataFrames; it supports all basic join type operations available in traditional SQL like INNER, LEFT OUTER, RIGHT OUTER, LEFT ANTI, LEFT SEMI, CROSS, SELF JOIN. PySpark Joins are wider transformations that involve data shuffling across the network.

How to join on multiple columns in pyspark Dataframe?

- GeeksforGeeks How to join on multiple columns in Pyspark? In this article, we will discuss how to join multiple columns in PySpark Dataframe using Python. we can join the multiple columns by using join () function using conditional operator

What is the difference between inner and outer join in pyspark?

Must be found in both df1 and df2. Inner Join in pyspark is the simplest and most common type of join. It is also known as simple join or Natural Join. Inner join returns the rows when matching condition is met. view source print? outer Join in pyspark combines the results of both left and right outer joins.

What is DF1 and DF2 in pyspark?

df1 − Dataframe1. df2 – Dataframe2. on − Columns (names) to join on. Must be found in both df1 and df2. Inner Join in pyspark is the simplest and most common type of join. It is also known as simple join or Natural Join.


1 Answers

You can use equi-join:

 minTime.join(maxTime, ["UserId"]).join(sumTime, ["UserId"])

aliases:

minTime.alias("minTime").join(
    maxTime.alias("maxTime"), 
    col("minTime.UserId") == col("maxTime.UserId")
)

or reference parent table:

(minTime
  .join(maxTime, minTime["UserId"] == maxTime["UserId"])
  .join(sumTime, minTime["UserId"] == sumTime["UserId"]))

On as side note you're quoting RDD docs, not DataFrame ones. These are different data structures and don't operate in the same way.

Also it looks like you're doing something strange here. Assuming you have a single parent table min, max and sum can be computed as simple aggregations without join.

like image 143
zero323 Avatar answered Sep 17 '22 18:09

zero323