Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check if a Spark data frame struct Array contains a specific value

I have a data frame with following schema

My requirement is to filter the rows that matches given field like city in any of the address array elements.I can access individual fields like loyaltyMember.address[0].city, but i have to check all address array elements to see if any match exists. How can i achieve that in spark sql, i couldn't use array_contains function since the array is of complex type

root
 |-- loyaltyMember: struct (nullable = true)
 |    |-- Name: string (nullable = true)
 |    |-- address: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- addressType: string (nullable = true)
 |    |    |    |-- city: string (nullable = true)
 |    |    |    |-- countryCode: string (nullable = true)
 |    |    |    |-- postalCode: string (nullable = true)
 |    |    |    |-- street: string (nullable = true)
like image 321
Despicable me Avatar asked Sep 05 '19 18:09

Despicable me


People also ask

How do you check if a column contains a particular value in PySpark?

In Spark & PySpark, contains() function is used to match a column value contains in a literal string (matches on part of the string), this is mostly used to filter rows on DataFrame.

How do you filter an array in PySpark?

If you are coming from a SQL background, you can use the where() clause instead of the filter() function to filter the rows from RDD/DataFrame based on the given condition or SQL expression. Both of these functions operate exactly the same. This can be done with the help of pySpark filter().

How do you check if a value is in a PySpark DataFrame?

In Spark use isin() function of Column class to check if a column value of DataFrame exists/contains in a list of string values.

What does describe () do in PySpark?

DESCRIBE FUNCTION statement returns the basic metadata information of an existing function. The metadata information includes the function name, implementing class and the usage details. If the optional EXTENDED option is specified, the basic metadata information is returned along with the extended usage information.


1 Answers

I believe you can still use array_contains as follows (in PySpark):

from pyspark.sql.functions import col, array_contains

df.filter(array_contains(col('loyaltyMember.address.city'), 'Prague'))

This will filter all rows that have in the array column city element 'Prague'.

like image 121
David Vrba Avatar answered Sep 22 '22 15:09

David Vrba