Multiple consecutive join with pyspark

Tags:

I'm trying to join multiple DF together. Because how join work, I got the same column name duplicated all over.

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.

# Join Min and Max to S1
joinned_s1 = (minTime.join(maxTime, minTime["UserId"] == maxTime["UserId"]))

# Join S1 and sum to s2
joinned_s2 = (joinned_s1.join(sumTime, joinned_s1["UserId"] == sumTime["UserId"]))

I got this error: ""Reference 'UserId' is ambiguous, could be: UserId#1578, UserId#3014.;"

What is the proper way of removing W from my dataset once successfully joined?

792

asked Jul 19 '16 21:07

Ahmet

1 Answers

You can use equi-join:

 minTime.join(maxTime, ["UserId"]).join(sumTime, ["UserId"])

aliases:

minTime.alias("minTime").join(
    maxTime.alias("maxTime"), 
    col("minTime.UserId") == col("maxTime.UserId")
)

or reference parent table:

(minTime
  .join(maxTime, minTime["UserId"] == maxTime["UserId"])
  .join(sumTime, minTime["UserId"] == sumTime["UserId"]))

On as side note you're quoting RDD docs, not DataFrame ones. These are different data structures and don't operate in the same way.

Also it looks like you're doing something strange here. Assuming you have a single parent table min, max and sum can be computed as simple aggregations without join.

143

answered Sep 17 '22 18:09

zero323

Related questions
                            
                                Run django application without django.contrib.admin
                            
                                Adding column(s) from one dataframe to another python pandas
                            
                                how to speed up NE recognition with stanford NER with python nltk
                            
                                How to test tensorflow cifar10 cnn tutorial model
                            
                                how to use matplotlib quiver scale
                            
                                Add multiple columns with zero values from a list to a Pandas data frame
                            
                                seaborn boxplots at desired distances along the x axis
                            
                                Reading in csv file as dataframe from hdfs
                            
                                Python mock object instantiation
                            
                                parallel processing in pandas python
                            
                                Is there a difference between setting a variable to None or deleting it? [duplicate]
                            
                                how to understand empty dimension in python numpy array?
                            
                                Use pdist() in python with a custom distance function defined by you
                            
                                PUT and DELETE Django
                            
                                Why are multiprocessing.sharedctypes assignments so slow?
                            
                                using decorators to persist python objects
                            
                                Python import CSV short code (pandas?) delimited with ';' and ',' in entires
                            
                                Numpy - check if elements of a array belong to another array
                            
                                find the start position of the longest sequence of 1's
                            
                                Why does a Python script to read files cause my computer to emit beeping sounds?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Multiple consecutive join with pyspark

Tags:

python

apache-spark

apache-spark-sql

pyspark

Ahmet

People also ask

1 Answers

zero323

Recent Activity

Donate For Us