pyspark dataframe filter or include based on list

Tags:

I am trying to filter a dataframe in pyspark using a list. I want to either filter based on the list or include only those records with a value in the list. My code below does not work:

# define a dataframe rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)]) df = sqlContext.createDataFrame(rdd, ["id", "score"])  # define a list of scores l = [10,18,20]  # filter out records by scores by list l records = df.filter(df.score in l) # expected: (0,1), (0,1), (0,2), (1,2)  # include only records with these scores in list l records = df.where(df.score in l) # expected: (1,10), (1,20), (3,18), (3,18), (3,18)

Gives the following error: ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

948

asked Nov 04 '16 11:11

user3133475

1 Answers

what it says is "df.score in l" can not be evaluated because df.score gives you a column and "in" is not defined on that column type use "isin"

The code should be like this:

# define a dataframe rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)]) df = sqlContext.createDataFrame(rdd, ["id", "score"])  # define a list of scores l = [10,18,20]  # filter out records by scores by list l records = df.filter(~df.score.isin(l)) # expected: (0,1), (0,1), (0,2), (1,2)  # include only records with these scores in list l df.filter(df.score.isin(l)) # expected: (1,10), (1,20), (3,18), (3,18), (3,18)

Note that where() is an alias for filter(), so both are interchangeable.

answered Oct 05 '22 02:10

user3133475

Related questions
                            
                                reduceByKey: How does it work internally?
                            
                                Write to multiple outputs by key Spark - one Spark job
                            
                                Spark - SELECT WHERE or filtering?
                            
                                What does setMaster `local[*]` mean in spark?
                            
                                How to perform union on two DataFrames with different amounts of columns in spark?
                            
                                Errors when using OFF_HEAP Storage with Spark 1.4.0 and Tachyon 0.6.4
                            
                                How to check the Spark version
                            
                                How do I skip a header from CSV files in Spark?
                            
                                how to loop through each row of dataFrame in pyspark
                            
                                Spark code organization and best practices [closed]
                            
                                How do I convert an array (i.e. list) column to Vector
                            
                                How to join on multiple columns in Pyspark?
                            
                                How does createOrReplaceTempView work in Spark?
                            
                                Create Spark DataFrame. Can not infer schema for type: <type 'float'>
                            
                                What is the difference between spark checkpoint and persist to a disk
                            
                                How to use Column.isin with list?
                            
                                Querying Spark SQL DataFrame with complex types
                            
                                How to make good reproducible Apache Spark examples
                            
                                How to use JDBC source to write and read data in (Py)Spark?
                            
                                Cannot find col function in pyspark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pyspark dataframe filter or include based on list

Tags:

filter

apache-spark

apache-spark-sql

pyspark

user3133475

People also ask

1 Answers

user3133475

Recent Activity

Donate For Us