Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter PySpark DataFrame by checking if string appears in column

I'm new to Spark and playing around with filtering. I have a pyspark.sql DataFrame created by reading in a json file. A part of the schema is shown below:

root
 |-- authors: array (nullable = true)
 |    |-- element: string (containsNull = true)

I would like to filter this DataFrame, selecting all of the rows with entries pertaining to a particular author. So whether this author is the first author listed in authors or the nth, the row should be included if their name appears. So something along the lines of

df.filter(df['authors'].getItem(i)=='Some Author')

where i iterates through all authors in that row, which is not constant across rows.

I tried implementing the solution given to PySpark DataFrames: filter where some value is in array column, but it gives me

ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling

Is there a succinct way to implement this filter?

like image 832
Dan McCabe Avatar asked Sep 19 '17 22:09

Dan McCabe


People also ask

How do you check if a column contains a string in PySpark?

In Spark & PySpark, contains() function is used to match a column value contains in a literal string (matches on part of the string), this is mostly used to filter rows on DataFrame.

How do I filter specific columns in PySpark DataFrame?

Select Single & Multiple Columns From PySpark You can select the single or multiple columns of the DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with selected columns.

How do you check if a value is in a PySpark DataFrame?

In Spark isin() function is used to check if the DataFrame column value exists in a list/array of values. To use IS NOT IN, use the NOT operator to negate the result of the isin() function.

What does show () do in PySpark?

Spark show() – Display DataFrame Contents in Table. Spark DataFrame show() is used to display the contents of the DataFrame in a Table Row & Column Format. By default, it shows only 20 Rows and the column values are truncated at 20 characters.


1 Answers

You can use pyspark.sql.functions.array_contains method:

df.filter(array_contains(df['authors'], 'Some Author'))

from pyspark.sql.types import *
from pyspark.sql.functions import array_contains

lst = [(["author 1", "author 2"],), (["author 2"],) , (["author 1"],)]
schema = StructType([StructField("authors", ArrayType(StringType()), True)])
df = spark.createDataFrame(lst, schema)
df.show()
+--------------------+
|             authors|
+--------------------+
|[author 1, author 2]|
|          [author 2]|
|          [author 1]|
+--------------------+

df.printSchema()
root
 |-- authors: array (nullable = true)
 |    |-- element: string (containsNull = true)

df.filter(array_contains(df.authors, "author 1")).show()
+--------------------+
|             authors|
+--------------------+
|[author 1, author 2]|
|          [author 1]|
+--------------------+
like image 161
Psidom Avatar answered Oct 11 '22 22:10

Psidom