My Schema: <pre class="prettyprint lang-none prettyprint-override"><code>|-- Canonical_URL: string (nullable = true) |-- Certifications: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- Certification_Authority: string (nullable = true) | | |-- End: string (nullable = true) | | |-- License: string (nullable = true) | | |-- Start: string (nullable = true) | | |-- Title: string (nullable = true) |-- CompanyId: string (nullable = true) |-- Country: string (nullable = true) |-- vendorTags: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- score: double (nullable = true) | | |-- vendor: string (nullable = true) </code></pre> I tried the below query to select nested fields from <code>vendorTags</code> <pre class="prettyprint"><code>df3 = sqlContext.sql("select vendorTags.vendor from globalcontacts") </code></pre> How can I query the nested fields in <code>where</code> clause like below in PySpark <pre class="prettyprint"><code>df3 = sqlContext.sql("select vendorTags.vendor from globalcontacts where vendorTags.vendor = 'alpha'") </code></pre> or <pre class="prettyprint"><code>df3 = sqlContext.sql("select vendorTags.vendor from globalcontacts where vendorTags.score > 123.123456") </code></pre> something like this.. I tried the above queries only to get the below error <pre class="prettyprint"><code>df3 = sqlContext.sql("select vendorTags.vendor from globalcontacts where vendorTags.vendor = 'alpha'") </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>16/03/15 13:16:02 INFO ParseDriver: Parsing command: select vendorTags.vendor from globalcontacts where vendorTags.vendor = 'alpha' 16/03/15 13:16:03 INFO ParseDriver: Parse Completed Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/spark/python/pyspark/sql/context.py", line 583, in sql return DataFrame(self._ssql_ctx.sql(sqlQuery), self) File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ File "/usr/lib/spark/python/pyspark/sql/utils.py", line 51, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u"cannot resolve '(vendorTags.vendor = cast(alpha as double))' due to data type mismatch: differing types in '(vendorTags.vendor = cast(alpha as double))' (array<string> and double).; line 1 pos 71" </code></pre>

For equality based queries you can use <code>array_contains</code>: <pre class="prettyprint"><code>df = sc.parallelize([(1, [1, 2, 3]), (2, [4, 5, 6])]).toDF(["k", "v"]) df.createOrReplaceTempView("df") # With SQL sqlContext.sql("SELECT * FROM df WHERE array_contains(v, 1)") # With DSL from pyspark.sql.functions import array_contains df.where(array_contains("v", 1)) </code></pre> If you want to use more complex predicates you'll have to either <code>explode</code> or use an UDF, for example something like this: <pre class="prettyprint"><code>from pyspark.sql.types import BooleanType from pyspark.sql.functions import udf def exists(f): return udf(lambda xs: any(f(x) for x in xs), BooleanType()) df.where(exists(lambda x: x > 3)("v")) </code></pre> In Spark 2.4. or later it is also possible to use higher order functions <pre class="prettyprint"><code>from pyspark.sql.functions import expr df.where(expr("""aggregate( transform(v, x -> x > 3), false, (x, y) -> x or y )""")) </code></pre> or <pre class="prettyprint"><code>df.where(expr(""" exists(v, x -> x > 3) """)) </code></pre> Python wrappers should be available in 3.1 (SPARK-30681).

How to filter based on array value in PySpark?

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

My Schema:

|-- Canonical_URL: string (nullable = true)
 |-- Certifications: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Certification_Authority: string (nullable = true)
 |    |    |-- End: string (nullable = true)
 |    |    |-- License: string (nullable = true)
 |    |    |-- Start: string (nullable = true)
 |    |    |-- Title: string (nullable = true)
 |-- CompanyId: string (nullable = true)
 |-- Country: string (nullable = true)
|-- vendorTags: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- score: double (nullable = true)
 |    |    |-- vendor: string (nullable = true)

I tried the below query to select nested fields from vendorTags

df3 = sqlContext.sql("select vendorTags.vendor from globalcontacts")

How can I query the nested fields in where clause like below in PySpark

df3 = sqlContext.sql("select vendorTags.vendor from globalcontacts where vendorTags.vendor = 'alpha'")

df3 = sqlContext.sql("select vendorTags.vendor from globalcontacts where vendorTags.score > 123.123456")

something like this..

I tried the above queries only to get the below error

df3 = sqlContext.sql("select vendorTags.vendor from globalcontacts where vendorTags.vendor = 'alpha'")

16/03/15 13:16:02 INFO ParseDriver: Parsing command: select vendorTags.vendor from globalcontacts where vendorTags.vendor = 'alpha'
16/03/15 13:16:03 INFO ParseDriver: Parse Completed
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/context.py", line 583, in sql
    return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
  File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 51, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"cannot resolve '(vendorTags.vendor = cast(alpha as double))' due to data type mismatch: differing types in '(vendorTags.vendor = cast(alpha as double))' (array<string> and double).; line 1 pos 71"

502

asked Mar 15 '16 14:03

Suhas Chandramouli

2 Answers

For equality based queries you can use array_contains:

df = sc.parallelize([(1, [1, 2, 3]), (2, [4, 5, 6])]).toDF(["k", "v"])
df.createOrReplaceTempView("df")

# With SQL
sqlContext.sql("SELECT * FROM df WHERE array_contains(v, 1)")

# With DSL
from pyspark.sql.functions import array_contains
df.where(array_contains("v", 1))

If you want to use more complex predicates you'll have to either explode or use an UDF, for example something like this:

from pyspark.sql.types import BooleanType
from pyspark.sql.functions import udf 

def exists(f):
    return udf(lambda xs: any(f(x) for x in xs), BooleanType())

df.where(exists(lambda x: x > 3)("v"))

In Spark 2.4. or later it is also possible to use higher order functions

from pyspark.sql.functions import expr

df.where(expr("""aggregate(
    transform(v, x -> x > 3),
    false, 
    (x, y) -> x or y
)"""))

df.where(expr("""
    exists(v, x -> x > 3)
"""))

Python wrappers should be available in 3.1 (SPARK-30681).

140

answered Sep 18 '22 13:09

zero323

In spark 2.4 you can filter array values using filter function in sql API.

https://spark.apache.org/docs/2.4.0/api/sql/index.html#filter

Here's example in pyspark. In the example we filter out all array values which are empty strings:

df = df.withColumn("ArrayColumn", expr("filter(ArrayColumn, x -> x != '')"))

answered Sep 20 '22 13:09

Jack

Related questions
                            
                                Python Turtle, draw text with on screen with larger font
                            
                                Python Flask WTForms: How can I disable a field dynamically in a view?
                            
                                live updating with matplotlib
                            
                                XML Declaration standalone="yes" lxml
                            
                                ImportError: No module named mpl_toolkits with maptlotlib 1.3.0 and py2exe
                            
                                pandas plot dataframe barplot with colors by category
                            
                                Transparency for Poly3DCollection plot in matplotlib
                            
                                How to read the last MB of a very large text file
                            
                                Python - How to save functions
                            
                                how to align text to the left?
                            
                                h5py: Correct way to slice array datasets
                            
                                Python programming functional vs. imperative code
                            
                                Beautiful Soup 4: Remove comment tag and its content
                            
                                How to convert all Decimals in a Python data structure to string?
                            
                                ValueError: Incorrect timezone setting while migrating manage.py file in Django
                            
                                Parsing of table from .docx file [closed]
                            
                                Print real roots only in numpy
                            
                                Can I create model in Django without automatic ID?
                            
                                How to convert SQL query results into a python dictionary
                            
                                Getting tweet replies to a particular tweet from a particular user

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With