Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to filter Spark DataFrames to return all rows where a column value is in a list using pyspark?

How can I return only the rows of a Spark DataFrame where the values for a column are within a specified list?

Here's my Python pandas way of doing this operation:

df_start = df[df['name'].isin(['App Opened', 'App Launched'])].copy()

I saw this SO scala implementation and tried several permutations, but couldn't get it to work.

Here's one failed attempt to do it using pyspark:

df_start = df_spark.filter(col("name") isin ['App Opened', 'App Launched'])

Output:

Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-6660042787423349557.py", line 253, in <module>
    code = compile('\n'.join(final_code), '<stdin>', 'exec', ast.PyCF_ONLY_AST, 1)
  File "<stdin>", line 18
    df_start = df_spark.filter(col("name") isin ['App Opened', 'App Launched'])
                                               ^
SyntaxError: invalid syntax

Another attempt:

df_start = df_spark.filter(col("name").isin(['App Opened', 'App Launched']))

Output:

Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-6660042787423349557.py", line 267, in <module>
    raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-6660042787423349557.py", line 260, in <module>
    exec(code)
  File "<stdin>", line 18, in <module>
NameError: name 'col' is not defined
like image 710
mgig Avatar asked Mar 13 '17 22:03

mgig


People also ask

How do you filter a DataFrame based on a column value PySpark?

Filter on an Array column When you want to filter rows from DataFrame based on value present in an array collection column, you can use the first syntax. The below example uses array_contains() from Pyspark SQL functions which checks if a value contains in an array if present it returns true otherwise false.

How do I filter rows in Spark DataFrame?

Spark filter() or where() function is used to filter the rows from DataFrame or Dataset based on the given one or multiple conditions or SQL expression. You can use where() operator instead of the filter if you are coming from SQL background. Both these functions operate exactly the same.

What does PySpark filter return?

In PySpark, filter() is used to filter the rows in the DataFrame. It will return the new dataframe by filtering the rows in the existing dataframe.


1 Answers

As dmdmdmdmdmd pointed out in the comments, the second method didn't work because col needed to be imported:

from pyspark.sql.functions import col
df_start = df_spark.filter(col("name").isin(['App Opened', 'App Launched']))

Here's another way of accomplishing the filter:

df_start = df_spark.filter(df_spark.name.isin(['App Opened', 'App Launched']))
like image 105
mgig Avatar answered Oct 27 '22 00:10

mgig