I am trying to filter a dataframe in pyspark using a list. I want to either filter based on the list or include only those records with a value in the list. My code below does not work:
# define a dataframe rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)]) df = sqlContext.createDataFrame(rdd, ["id", "score"]) # define a list of scores l = [10,18,20] # filter out records by scores by list l records = df.filter(df.score in l) # expected: (0,1), (0,1), (0,2), (1,2) # include only records with these scores in list l records = df.where(df.score in l) # expected: (1,10), (1,20), (3,18), (3,18), (3,18)
Gives the following error: ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
pyspark's 'between' function is not inclusive for timestamp input.
In PySpark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking isNULL() of PySpark Column class. The above statements return all rows that have null values on the state column and the result is returned as the new DataFrame.
what it says is "df.score in l" can not be evaluated because df.score gives you a column and "in" is not defined on that column type use "isin"
The code should be like this:
# define a dataframe rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)]) df = sqlContext.createDataFrame(rdd, ["id", "score"]) # define a list of scores l = [10,18,20] # filter out records by scores by list l records = df.filter(~df.score.isin(l)) # expected: (0,1), (0,1), (0,2), (1,2) # include only records with these scores in list l df.filter(df.score.isin(l)) # expected: (1,10), (1,20), (3,18), (3,18), (3,18)
Note that where()
is an alias for filter()
, so both are interchangeable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With