Get index of item in array that is a column in a Spark dataframe

Tags:

apache-spark

pyspark

I am able to filter a Spark dataframe (in PySpark) based on if a particular value exists within an array field by doing the following:

from pyspark.sql.functions import array_contains
spark_df.filter(array_contains(spark_df.array_column_name, "value that I want")).show()

Is there a way to get the index of where in the array the item was found? It seems like that should exist, but I am not finding it. Thank you.

421

asked Dec 12 '18 19:12

user1624577

1 Answers

In spark 2.4+, there's the array_position function:

df = spark.createDataFrame([(["c", "b", "a"],), ([],)], ['data'])
df.show()
#+---------+
#|     data|
#+---------+
#|[c, b, a]|
#|       []|
#+---------+

from pyspark.sql.functions import array_position
df.select(df.data, array_position(df.data, "a").alias('a_pos')).show()
#+---------+-----+
#|     data|a_pos|
#+---------+-----+
#|[c, b, a]|    3|
#|       []|    0|
#+---------+-----+

Notes from the docs:

Locates the position of only the first occurrence of the given value in the given array;
The position is not zero based, but 1 based index. Returns 0 if the given value could not be found in the array.

answered Oct 11 '22 20:10

Psidom

Related questions
                            
                                Understanding Representation of Vector Column in Spark SQL
                            
                                How to Read Data from DB in Spark in parallel
                            
                                How to do aggregation on multiple columns at once in Spark
                            
                                spark jdbc df limit... what is it doing?
                            
                                How to get max length of string column from dataframe using scala?
                            
                                Custom partitioner in SPARK (pyspark)
                            
                                Check if arraytype column contains null
                            
                                PySpark, top for DataFrame
                            
                                Writing Spark dataframe as parquet to S3 without creating a _temporary folder
                            
                                How to export data from Cassandra to BigQuery
                            
                                How to get date from different year, month and day columns in spark (scala)
                            
                                How to wait until all executors are allocated before Spark application starts on YARN?
                            
                                Build Spark SQL query dynamically
                            
                                Why does Spark on YARN in cluster mode fail with "Exception in thread "Driver" java.lang.NullPointerException"?
                            
                                PySpark: create dataframe from random uniform disribution
                            
                                How to force a certain partitioning in a PySpark DataFrame?
                            
                                Coalesce columns in spark dataframe
                            
                                Dataframe: how to groupBy/count then order by count in Scala
                            
                                Error using spark 'save' does not support bucketing right now
                            
                                How to find installation directory of Apache Spark package in Homebrew?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With