Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark dataframe filter or include based on list

I am trying to filter a dataframe in pyspark using a list. I want to either filter based on the list or include only those records with a value in the list. My code below does not work:

# define a dataframe rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)]) df = sqlContext.createDataFrame(rdd, ["id", "score"])  # define a list of scores l = [10,18,20]  # filter out records by scores by list l records = df.filter(df.score in l) # expected: (0,1), (0,1), (0,2), (1,2)  # include only records with these scores in list l records = df.where(df.score in l) # expected: (1,10), (1,20), (3,18), (3,18), (3,18) 

Gives the following error: ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

like image 948
user3133475 Avatar asked Nov 04 '16 11:11

user3133475


People also ask

Is PySpark between inclusive?

pyspark's 'between' function is not inclusive for timestamp input.

How do you use isNULL in PySpark?

In PySpark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking isNULL() of PySpark Column class. The above statements return all rows that have null values on the state column and the result is returned as the new DataFrame.


1 Answers

what it says is "df.score in l" can not be evaluated because df.score gives you a column and "in" is not defined on that column type use "isin"

The code should be like this:

# define a dataframe rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)]) df = sqlContext.createDataFrame(rdd, ["id", "score"])  # define a list of scores l = [10,18,20]  # filter out records by scores by list l records = df.filter(~df.score.isin(l)) # expected: (0,1), (0,1), (0,2), (1,2)  # include only records with these scores in list l df.filter(df.score.isin(l)) # expected: (1,10), (1,20), (3,18), (3,18), (3,18) 

Note that where() is an alias for filter(), so both are interchangeable.

like image 79
user3133475 Avatar answered Oct 05 '22 02:10

user3133475