Pyspark: Match values in one column against a list in same row in another column

Question

I have a dataframe containing following 2 columns, amongst others: 1. ID 2. list_IDs

I am trying to create a 3rd column returning a boolean True or False if the ID is present in the list_ID column in the same row

I have tried using the following:

df = sqlContext.createDataFrame([(1, [1, 2, 3,]), (2, [1, 3, 4])], ("ID", "list_IDs"))

df.withColumn("IDmatch", when(col("ID").isin(F.col("list_IDs")), True).otherwise(False)).show()

That doesn't work. However, If I were to provide some static list to match against, it works of course.

df.withColumn("IDmatch", when(col("ID").isin([2, 3]), True).otherwise(False)).show()

I can use a udf to return a boolean type and that works as well:

@udf(returnType=BooleanType())
def isinlist(x, y):
    return x in y

However, I am trying to avoid using UDF in this case, if possible and I was wondering if it's possible to use something native akin to .isin() to check if the ID in a row is present in the list of values in the list_ID column for the same row?

noufel13 · Accepted Answer

Method 1:

If you are on Spark >= 2.4.0. You can use the inbuilt arrays_overlap function. This function takes in 2 arrays and checks for the common elements amongst them.

from pyspark.sql.functions import arrays_overlap, array

df.withColumn("IDmatch", arrays_overlap(df.list_IDs, array(df.ID))).show()

Output:

+---+---------+-------+
| ID| list_IDs|IDmatch|
+---+---------+-------+
|  1|[1, 2, 3]|   true|
|  2|[1, 3, 4]|  false|
+---+---------+-------+

You can read more about it here, https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.arrays_overlap

Method 2:

Alternatively, you can also use an udf to obtain the same output

from pyspark.sql.functions import udf

element_check = udf(lambda elt_list, elt: elt in elt_list)

df.withColumn("IDmatch", element_check(df.list_IDs, df.ID)).show()

Pyspark: Match values in one column against a list in same row in another column

Tags:

python

apache-spark

pyspark

Tarun

1 Answers

noufel13

Recent Activity

Donate For Us

Pyspark: Match values in one column against a list in same row in another column

Tags:

python

apache-spark

pyspark

Tarun

1 Answers

noufel13

Related questions

Recent Activity

Donate For Us