Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark: How to check if list of string values exists in dataframe and print values to a list

I have a df NAMES in which if I output via display(NAMES):

NAMES

John

Sarah

Michael

Sean

I also have a list students, print(students):

{John, Alan, Andy}

Question:

Based on this list (students), how can I loop through the df with "NAMES" Column and output to another list the names of students who are in the list and also in the DF.

Expected output of list: "John"

I have tried

list2 = []
for i in NAMES:
     for g in students:
        if i == g:
          list2.append(i)

but i end up with an error, how can i implement this via pyspark?

Thanks.

like image 394
Techno04335 Avatar asked Oct 19 '25 05:10

Techno04335


1 Answers

In general looping through data in pyspark will not be very efficient. When possible use native pyspark functions. For your specific question you can use the filter function that will filter your DataFrame by names in the student list:

df_names.filter(col("name").isin(students)).select("name")

In your example the only return value will be John.

like image 130
vielkind Avatar answered Oct 21 '25 19:10

vielkind



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!