Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to filter column on values in list in pyspark?

I have a dataframe rawdata on which i have to apply filter condition on column X with values CB,CI and CR. So I used the below code:

df = dfRawData.filter(col("X").between("CB","CI","CR"))

But I am getting the following error:

between() takes exactly 3 arguments (4 given)

Please let me know how I can resolve this issue.

like image 719
LKA Avatar asked Oct 12 '17 10:10

LKA


People also ask

How do you check if a column contains a particular value in PySpark?

The contains() method checks whether a DataFrame column string contains a string specified as an argument (matches on part of the string). Returns true if the string exists and false if not.

How do I select a specific column in PySpark?

In PySpark, select() function is used to select single, multiple, column by index, all columns from the list and the nested columns from a DataFrame, PySpark select() is a transformation function hence it returns a new DataFrame with the selected columns.


1 Answers

The function between is used to check if the value is between two values, the input is a lower bound and an upper bound. It can not be used to check if a column value is in a list. To do that, use isin:

import pyspark.sql.functions as f
df = dfRawData.where(f.col("X").isin(["CB", "CI", "CR"]))
like image 67
Shaido Avatar answered Sep 22 '22 20:09

Shaido