How can I return only the rows of a Spark DataFrame where the values for a column are within a specified list?
Here's my Python pandas way of doing this operation:
df_start = df[df['name'].isin(['App Opened', 'App Launched'])].copy()
I saw this SO scala implementation and tried several permutations, but couldn't get it to work.
Here's one failed attempt to do it using pyspark:
df_start = df_spark.filter(col("name") isin ['App Opened', 'App Launched'])
Output:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-6660042787423349557.py", line 253, in <module>
code = compile('\n'.join(final_code), '<stdin>', 'exec', ast.PyCF_ONLY_AST, 1)
File "<stdin>", line 18
df_start = df_spark.filter(col("name") isin ['App Opened', 'App Launched'])
^
SyntaxError: invalid syntax
Another attempt:
df_start = df_spark.filter(col("name").isin(['App Opened', 'App Launched']))
Output:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-6660042787423349557.py", line 267, in <module>
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-6660042787423349557.py", line 260, in <module>
exec(code)
File "<stdin>", line 18, in <module>
NameError: name 'col' is not defined
Filter on an Array column When you want to filter rows from DataFrame based on value present in an array collection column, you can use the first syntax. The below example uses array_contains() from Pyspark SQL functions which checks if a value contains in an array if present it returns true otherwise false.
Spark filter() or where() function is used to filter the rows from DataFrame or Dataset based on the given one or multiple conditions or SQL expression. You can use where() operator instead of the filter if you are coming from SQL background. Both these functions operate exactly the same.
In PySpark, filter() is used to filter the rows in the DataFrame. It will return the new dataframe by filtering the rows in the existing dataframe.
As dmdmdmdmdmd pointed out in the comments, the second method didn't work because col
needed to be imported:
from pyspark.sql.functions import col
df_start = df_spark.filter(col("name").isin(['App Opened', 'App Launched']))
Here's another way of accomplishing the filter:
df_start = df_spark.filter(df_spark.name.isin(['App Opened', 'App Launched']))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With