Filter PySpark DataFrame by checking if string appears in column

Tags:

I'm new to Spark and playing around with filtering. I have a pyspark.sql DataFrame created by reading in a json file. A part of the schema is shown below:

root
 |-- authors: array (nullable = true)
 |    |-- element: string (containsNull = true)

I would like to filter this DataFrame, selecting all of the rows with entries pertaining to a particular author. So whether this author is the first author listed in authors or the nth, the row should be included if their name appears. So something along the lines of

df.filter(df['authors'].getItem(i)=='Some Author')

where i iterates through all authors in that row, which is not constant across rows.

I tried implementing the solution given to PySpark DataFrames: filter where some value is in array column, but it gives me

ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling

Is there a succinct way to implement this filter?

832

asked Sep 19 '17 22:09

Dan McCabe

1 Answers

You can use pyspark.sql.functions.array_contains method:

df.filter(array_contains(df['authors'], 'Some Author'))

from pyspark.sql.types import *
from pyspark.sql.functions import array_contains

lst = [(["author 1", "author 2"],), (["author 2"],) , (["author 1"],)]
schema = StructType([StructField("authors", ArrayType(StringType()), True)])
df = spark.createDataFrame(lst, schema)
df.show()
+--------------------+
|             authors|
+--------------------+
|[author 1, author 2]|
|          [author 2]|
|          [author 1]|
+--------------------+

df.printSchema()
root
 |-- authors: array (nullable = true)
 |    |-- element: string (containsNull = true)

df.filter(array_contains(df.authors, "author 1")).show()
+--------------------+
|             authors|
+--------------------+
|[author 1, author 2]|
|          [author 1]|
+--------------------+

161

answered Oct 11 '22 22:10

Psidom

Related questions
                            
                                Pandas Iterrows Row Number & Percentage
                            
                                Python: Dictionary key name that changes dynamically in a loop
                            
                                Numpy's float32 and float comparisons
                            
                                python split text by quotes and spaces
                            
                                shutil.move if directory already exists
                            
                                A multi-threading example of the python GIL
                            
                                Fix PIL.ImageDraw.Draw.line with wide lines
                            
                                Access remote DB via ssh tunnel (Python 3)
                            
                                Listing all tests associated with a given marker in Pytest
                            
                                TypeError: 'ImmutableMultiDict' objects are immutable python
                            
                                serving image files from django admin
                            
                                Exception handled surprisingly in Pyside slots
                            
                                Seconds until end of day in python
                            
                                Proper way to access a column of a pandas dataframe
                            
                                HTTP Error 404 from googlefinance in python 2.7
                            
                                How to checkout to a new branch with Pygithub?
                            
                                How to override the __dir__ method for a class?
                            
                                Unpacking an array in python
                            
                                Select rows from a pandas dataframe with a numpy 2D array on multiple columns
                            
                                get cairosvg working in windows

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Filter PySpark DataFrame by checking if string appears in column

Tags:

python

pyspark

pyspark-sql

Dan McCabe

People also ask

1 Answers

Psidom

Recent Activity

Donate For Us