I have a dataframe rawdata on which i have to apply filter condition on column X with values CB,CI and CR. So I used the below code:
df = dfRawData.filter(col("X").between("CB","CI","CR"))
But I am getting the following error:
between() takes exactly 3 arguments (4 given)
Please let me know how I can resolve this issue.
The contains() method checks whether a DataFrame column string contains a string specified as an argument (matches on part of the string). Returns true if the string exists and false if not.
In PySpark, select() function is used to select single, multiple, column by index, all columns from the list and the nested columns from a DataFrame, PySpark select() is a transformation function hence it returns a new DataFrame with the selected columns.
The function between
is used to check if the value is between two values, the input is a lower bound and an upper bound. It can not be used to check if a column value is in a list. To do that, use isin
:
import pyspark.sql.functions as f
df = dfRawData.where(f.col("X").isin(["CB", "CI", "CR"]))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With