I have a dataframe containing following 2 columns, amongst others: 1. ID 2. list_IDs
I am trying to create a 3rd column returning a boolean True or False if the ID is present in the list_ID column in the same row
I have tried using the following:
df = sqlContext.createDataFrame([(1, [1, 2, 3,]), (2, [1, 3, 4])], ("ID", "list_IDs"))
df.withColumn("IDmatch", when(col("ID").isin(F.col("list_IDs")), True).otherwise(False)).show()
That doesn't work. However, If I were to provide some static list to match against, it works of course.
df.withColumn("IDmatch", when(col("ID").isin([2, 3]), True).otherwise(False)).show()
I can use a udf to return a boolean type and that works as well:
@udf(returnType=BooleanType())
def isinlist(x, y):
return x in y
However, I am trying to avoid using UDF in this case, if possible and I was wondering if it's possible to use something native akin to .isin() to check if the ID in a row is present in the list of values in the list_ID column for the same row?
Method 1:
If you are on Spark >= 2.4.0. You can use the inbuilt arrays_overlap
function. This function takes in 2 arrays and checks for the common elements amongst them.
from pyspark.sql.functions import arrays_overlap, array
df.withColumn("IDmatch", arrays_overlap(df.list_IDs, array(df.ID))).show()
Output:
+---+---------+-------+
| ID| list_IDs|IDmatch|
+---+---------+-------+
| 1|[1, 2, 3]| true|
| 2|[1, 3, 4]| false|
+---+---------+-------+
You can read more about it here, https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.arrays_overlap
Method 2:
Alternatively, you can also use an udf
to obtain the same output
from pyspark.sql.functions import udf
element_check = udf(lambda elt_list, elt: elt in elt_list)
df.withColumn("IDmatch", element_check(df.list_IDs, df.ID)).show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With