Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to filter a Spark dataframe by a boolean column?

I created a dataframe that has the following schema:

In [43]: yelp_df.printSchema()
root
 |-- business_id: string (nullable = true)
 |-- cool: integer (nullable = true)
 |-- date: string (nullable = true)
 |-- funny: integer (nullable = true)
 |-- id: string (nullable = true)
 |-- stars: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- type: string (nullable = true)
 |-- useful: integer (nullable = true)
 |-- user_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- full_address: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- neighborhoods: string (nullable = true)
 |-- open: boolean (nullable = true)
 |-- review_count: integer (nullable = true)
 |-- state: string (nullable = true)

I want to select only the records with the "open" column that is "true". The following command I run in PySpark returns nothing:

yelp_df.filter(yelp_df["open"] == "true").collect()
like image 497
Nasreddin Avatar asked Apr 22 '16 02:04

Nasreddin


People also ask

How do I filter specific columns in PySpark DataFrame?

Select Single & Multiple Columns From PySpark You can select the single or multiple columns of the DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with selected columns.

How do you check if a column contains a particular value in PySpark?

In Spark & PySpark, contains() function is used to match a column value contains in a literal string (matches on part of the string), this is mostly used to filter rows on DataFrame.


2 Answers

You're comparing data types incorrectly. open is listed as a Boolean value, not a string, so doing yelp_df["open"] == "true" is incorrect - "true" is a string.

Instead you want to do

yelp_df.filter(yelp_df["open"] == True).collect()

This correctly compares the values of open against the Boolean primitive True, rather than the non-Boolean string "true".

like image 124
Akshat Mahajan Avatar answered Sep 28 '22 02:09

Akshat Mahajan


from pyspark.sql import functions as F

filtered_df = df.filter(F.col('my_bool_col'))

like image 40
X_Trust Avatar answered Sep 28 '22 01:09

X_Trust