Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark sql: Create a new column based on whether a value exists in a different DataFrame's column

I tried to follow this answer but my question is slightly different.

I have two pyspark data frames df2 and bears2. Both have an integer variable, and I want to create a boolean like this pseudocode:

df3 = df2.withColumn("game", (df2.week_id.isin(bears2.week_if), 1,0))

Basically, if the value of df2 exists in the corresponding column of bears2, I want a 1 else a 0

I tried the expr() from the other question, but wasn't able to get it to work. It looked like this:

new_column_1 = F.expr(
    """IF(df2.week_id IN(bears2.week_if), 1, 0))"""
    )
like image 882
mlewis Avatar asked May 30 '18 13:05

mlewis


People also ask

How to add column when not exists on Dataframe in pyspark?

Add Column When not Exists on DataFrame In order to add a column when not exists, you should check if desired column name exists in PySpark DataFrame, you can get the DataFrame columns using df.columns, now add a column conditionally when not exists in df.columns. if 'dummy' not in df. columns: df. withColumn ("dummy", lit (None))

How to add a column when it does not exist in Python?

In this example, we add a column named Details from Name and Company columns separated by “-” in the python language. In this method, the user can add a column when it is not existed by adding a column with the lit () function and checking using if the condition.

How to add a new column to an existing column in Python?

In this example, we add a column named salary with a value of 34000 to the above dataframe using the withColumn () function with the lit () function as its parameter in the python programming language. Under this approach, the user can add a new column based on an existing column in the given dataframe.

How to create new columns based on other columns in Dataframe?

We can easily create new columns based on other columns using the DataFrame’s withColumn () method. For example, if the column num is of type double, we can create a new column num_div_10 like so: But now, we want to set values for our new column based on certain conditions.


1 Answers

You can't currently use IN like that with pyspark-sql1. Instead, you will have to join the DataFrames.

Try something like:

from pyspark.sql.functions import col, when
df3 = df2.withColumn("id", col("week_id")).alias("df2")\
    .join(bears2.withColumn("id", col("week_if")).alias("bears2"), on="id", how="left")\
    .select("df2.*", when(col("bears2.id").isNotNull(), 1).otherwise(0))

For the join to work, the join key column has to exist in both DataFrames. For that reason, I first call withColumn("id", ...) to rename the columns to the same value for the join.

Next we do a LEFT join to keep all of the columns in df2. Finally we select all of the columns in df2 and use pyspark.sql.functions.when() to create the boolean column.

The first argument to when() is a condition. If it is True, the second argument is returned. If not, the value in otherwise() is used.

like image 163
pault Avatar answered Oct 05 '22 23:10

pault