Pyspark sql: Create a new column based on whether a value exists in a different DataFrame's column

Tags:

I tried to follow this answer but my question is slightly different.

I have two pyspark data frames df2 and bears2. Both have an integer variable, and I want to create a boolean like this pseudocode:

df3 = df2.withColumn("game", (df2.week_id.isin(bears2.week_if), 1,0))

Basically, if the value of df2 exists in the corresponding column of bears2, I want a 1 else a 0

I tried the expr() from the other question, but wasn't able to get it to work. It looked like this:

new_column_1 = F.expr(
    """IF(df2.week_id IN(bears2.week_if), 1, 0))"""
    )

882

asked May 30 '18 13:05

mlewis

1 Answers

You can't currently use IN like that with pyspark-sql¹. Instead, you will have to join the DataFrames.

Try something like:

from pyspark.sql.functions import col, when
df3 = df2.withColumn("id", col("week_id")).alias("df2")\
    .join(bears2.withColumn("id", col("week_if")).alias("bears2"), on="id", how="left")\
    .select("df2.*", when(col("bears2.id").isNotNull(), 1).otherwise(0))

For the join to work, the join key column has to exist in both DataFrames. For that reason, I first call withColumn("id", ...) to rename the columns to the same value for the join.

Next we do a LEFT join to keep all of the columns in df2. Finally we select all of the columns in df2 and use pyspark.sql.functions.when() to create the boolean column.

The first argument to when() is a condition. If it is True, the second argument is returned. If not, the value in otherwise() is used.

163

answered Oct 05 '22 23:10

pault

Related questions
                            
                                When is the right time to call loop.close()?
                            
                                Getting Ipywidgets To Interact With Pandas Dataframe In Jupyter Notebook
                            
                                Could not find a version that satisfies the requirement urllib
                            
                                Iterating over a list of dates from datetime index
                            
                                Load images and annotations from CSV and use fit_generator with multi-output models
                            
                                customizing typing.NamedTuple
                            
                                Django queryset permissions
                            
                                Flask - Get clicked link info and display on rendered page
                            
                                Pandas datetime64 to string
                            
                                Python month/day names in other language [duplicate]
                            
                                Joining .wav files without writing on disk in Python
                            
                                What does the redirection mean in apache beam (python)
                            
                                How i can use environment variables on .ini file in Pyramid?
                            
                                ValueError: multiclass-multioutput format is not supported using sklearn roc_auc_score function
                            
                                How to find the length of line segments using python
                            
                                How do I create pandas DataFrame (with index or multiindex) from list of namedtuple instances?
                            
                                Replace attributes in Data Class objects
                            
                                Make a numpy array monotonic without a Python loop
                            
                                For loop with custom steps in python
                            
                                ImportError: Could not find 'cudart64_100.dll

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark sql: Create a new column based on whether a value exists in a different DataFrame's column

Tags:

python

apache-spark

pyspark

pyspark-sql

mlewis

People also ask

1 Answers

pault

Recent Activity

Donate For Us