Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare two columns to create a new column in Spark DataFrame

I have a Spark DataFrame that has 2 columns, I am trying to create a new column using the other two columns with the when otherwise operation.

df_newcol = df.withColumn("Flag", when(col("a") <= lit(ratio1) | col("b") <= lit(ratio1), 1).otherwise(2))

But this throws an error

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

I have used when and otherwise previously with one column, while using it with multiple columns do we have to write the logic differently.

Thanks.

like image 403
Pramod Sripada Avatar asked Jan 28 '23 19:01

Pramod Sripada


1 Answers

You have an operator precedence issue, make sure you put comparison operators in parenthesis when the comparison is mixed with logical operators such as & and |, with which being fixed, you don't even need lit, a scalar should work as well:

import pyspark.sql.functions as F
df = spark.createDataFrame([[1, 2], [2, 3], [3, 4]], ['a', 'b'])

Both of the following should work:

df.withColumn('flag', F.when((F.col("a") <= F.lit(2)) | (F.col("b") <= F.lit(2)), 1).otherwise(2)).show()
+---+---+----+
|  a|  b|flag|
+---+---+----+
|  1|  2|   1|
|  2|  3|   1|
|  3|  4|   2|
+---+---+----+

df.withColumn('flag', F.when((F.col("a") <= 2) | (F.col("b") <= 2), 1).otherwise(2)).show()
+---+---+----+
|  a|  b|flag|
+---+---+----+
|  1|  2|   1|
|  2|  3|   1|
|  3|  4|   2|
+---+---+----+
like image 148
Psidom Avatar answered Mar 06 '23 22:03

Psidom