I created a dataframe that has the following schema:
In [43]: yelp_df.printSchema()
root
|-- business_id: string (nullable = true)
|-- cool: integer (nullable = true)
|-- date: string (nullable = true)
|-- funny: integer (nullable = true)
|-- id: string (nullable = true)
|-- stars: integer (nullable = true)
|-- text: string (nullable = true)
|-- type: string (nullable = true)
|-- useful: integer (nullable = true)
|-- user_id: string (nullable = true)
|-- name: string (nullable = true)
|-- full_address: string (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
|-- neighborhoods: string (nullable = true)
|-- open: boolean (nullable = true)
|-- review_count: integer (nullable = true)
|-- state: string (nullable = true)
I want to select only the records with the "open" column that is "true". The following command I run in PySpark returns nothing:
yelp_df.filter(yelp_df["open"] == "true").collect()
Select Single & Multiple Columns From PySpark You can select the single or multiple columns of the DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with selected columns.
In Spark & PySpark, contains() function is used to match a column value contains in a literal string (matches on part of the string), this is mostly used to filter rows on DataFrame.
You're comparing data types incorrectly. open
is listed as a Boolean value, not a string, so doing yelp_df["open"] == "true"
is incorrect - "true"
is a string.
Instead you want to do
yelp_df.filter(yelp_df["open"] == True).collect()
This correctly compares the values of open
against the Boolean primitive True
, rather than the non-Boolean string "true"
.
from pyspark.sql import functions as F
filtered_df = df.filter(F.col('my_bool_col'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With