I'm new to Spark and playing around with filtering. I have a pyspark.sql DataFrame created by reading in a json file. A part of the schema is shown below:
root
|-- authors: array (nullable = true)
| |-- element: string (containsNull = true)
I would like to filter this DataFrame, selecting all of the rows with entries pertaining to a particular author. So whether this author is the first author listed in authors
or the nth, the row should be included if their name appears. So something along the lines of
df.filter(df['authors'].getItem(i)=='Some Author')
where i
iterates through all authors in that row, which is not constant across rows.
I tried implementing the solution given to PySpark DataFrames: filter where some value is in array column, but it gives me
ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling
Is there a succinct way to implement this filter?
In Spark & PySpark, contains() function is used to match a column value contains in a literal string (matches on part of the string), this is mostly used to filter rows on DataFrame.
Select Single & Multiple Columns From PySpark You can select the single or multiple columns of the DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with selected columns.
In Spark isin() function is used to check if the DataFrame column value exists in a list/array of values. To use IS NOT IN, use the NOT operator to negate the result of the isin() function.
Spark show() – Display DataFrame Contents in Table. Spark DataFrame show() is used to display the contents of the DataFrame in a Table Row & Column Format. By default, it shows only 20 Rows and the column values are truncated at 20 characters.
You can use pyspark.sql.functions.array_contains
method:
df.filter(array_contains(df['authors'], 'Some Author'))
from pyspark.sql.types import *
from pyspark.sql.functions import array_contains
lst = [(["author 1", "author 2"],), (["author 2"],) , (["author 1"],)]
schema = StructType([StructField("authors", ArrayType(StringType()), True)])
df = spark.createDataFrame(lst, schema)
df.show()
+--------------------+
| authors|
+--------------------+
|[author 1, author 2]|
| [author 2]|
| [author 1]|
+--------------------+
df.printSchema()
root
|-- authors: array (nullable = true)
| |-- element: string (containsNull = true)
df.filter(array_contains(df.authors, "author 1")).show()
+--------------------+
| authors|
+--------------------+
|[author 1, author 2]|
| [author 1]|
+--------------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With