Is it possible to filter Spark DataFrames to return all rows where a column value is in a list using pyspark?

Tags:

How can I return only the rows of a Spark DataFrame where the values for a column are within a specified list?

Here's my Python pandas way of doing this operation:

df_start = df[df['name'].isin(['App Opened', 'App Launched'])].copy()

I saw this SO scala implementation and tried several permutations, but couldn't get it to work.

Here's one failed attempt to do it using pyspark:

df_start = df_spark.filter(col("name") isin ['App Opened', 'App Launched'])

Output:

Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-6660042787423349557.py", line 253, in <module>
    code = compile('\n'.join(final_code), '<stdin>', 'exec', ast.PyCF_ONLY_AST, 1)
  File "<stdin>", line 18
    df_start = df_spark.filter(col("name") isin ['App Opened', 'App Launched'])
                                               ^
SyntaxError: invalid syntax

Another attempt:

df_start = df_spark.filter(col("name").isin(['App Opened', 'App Launched']))

Output:

Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-6660042787423349557.py", line 267, in <module>
    raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-6660042787423349557.py", line 260, in <module>
    exec(code)
  File "<stdin>", line 18, in <module>
NameError: name 'col' is not defined

710

asked Mar 13 '17 22:03

mgig

1 Answers

As dmdmdmdmdmd pointed out in the comments, the second method didn't work because col needed to be imported:

from pyspark.sql.functions import col
df_start = df_spark.filter(col("name").isin(['App Opened', 'App Launched']))

Here's another way of accomplishing the filter:

df_start = df_spark.filter(df_spark.name.isin(['App Opened', 'App Launched']))

105

answered Oct 27 '22 00:10

mgig

Related questions
                            
                                Rotate screen in mac os with terminal
                            
                                Use boto3 to download from public bucket
                            
                                Python regex search: repeated digit n times
                            
                                How to switch two elements in string using Python RegEx?
                            
                                Counting consecutive 1's in NumPy array
                            
                                aiohttp - Set a cookie and then redirect the user
                            
                                Histogram bin size in seaborn
                            
                                Undefined symbol using Boost/Python
                            
                                How to test Python classes that depend on argparse?
                            
                                Jupyter notebook keeps reconnecting to kernel
                            
                                How to extend OrderedDict with defaultdict behavior
                            
                                unicode datas of a dataframe to strings
                            
                                How to exchange data between apps in Django using the database?
                            
                                Alphanumeric sorting in Python and negative numbers
                            
                                Regex subsequence matching
                            
                                Python unorderable types: NoneType() > int() when finding max of a list
                            
                                Aplpy multiplot dynamic axis sharing
                            
                                ValueError: array is too big when loading GoogleNews-vectors-negative
                            
                                Subplot of Windrose in matplotlib
                            
                                browsing image sequence with a slider in bokeh

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it possible to filter Spark DataFrames to return all rows where a column value is in a list using pyspark?

Tags:

python

apache-spark

pyspark

mgig

People also ask

1 Answers

mgig

Recent Activity

Donate For Us