I tried to follow this answer but my question is slightly different.
I have two pyspark data frames df2
and bears2
. Both have an integer variable, and I want to create a boolean like this pseudocode:
df3 = df2.withColumn("game", (df2.week_id.isin(bears2.week_if), 1,0))
Basically, if the value of df2
exists in the corresponding column of bears2
, I want a 1
else a 0
I tried the expr()
from the other question, but wasn't able to get it to work. It looked like this:
new_column_1 = F.expr(
"""IF(df2.week_id IN(bears2.week_if), 1, 0))"""
)
Add Column When not Exists on DataFrame In order to add a column when not exists, you should check if desired column name exists in PySpark DataFrame, you can get the DataFrame columns using df.columns, now add a column conditionally when not exists in df.columns. if 'dummy' not in df. columns: df. withColumn ("dummy", lit (None))
In this example, we add a column named Details from Name and Company columns separated by “-” in the python language. In this method, the user can add a column when it is not existed by adding a column with the lit () function and checking using if the condition.
In this example, we add a column named salary with a value of 34000 to the above dataframe using the withColumn () function with the lit () function as its parameter in the python programming language. Under this approach, the user can add a new column based on an existing column in the given dataframe.
We can easily create new columns based on other columns using the DataFrame’s withColumn () method. For example, if the column num is of type double, we can create a new column num_div_10 like so: But now, we want to set values for our new column based on certain conditions.
You can't currently use IN
like that with pyspark-sql1. Instead, you will have to join the DataFrames.
Try something like:
from pyspark.sql.functions import col, when
df3 = df2.withColumn("id", col("week_id")).alias("df2")\
.join(bears2.withColumn("id", col("week_if")).alias("bears2"), on="id", how="left")\
.select("df2.*", when(col("bears2.id").isNotNull(), 1).otherwise(0))
For the join to work, the join key column has to exist in both DataFrames. For that reason, I first call withColumn("id", ...)
to rename the columns to the same value for the join.
Next we do a LEFT join to keep all of the columns in df2
. Finally we select all of the columns in df2
and use pyspark.sql.functions.when()
to create the boolean column.
The first argument to when()
is a condition. If it is True
, the second argument is returned. If not, the value in otherwise()
is used.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With