Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get index of item in array that is a column in a Spark dataframe

I am able to filter a Spark dataframe (in PySpark) based on if a particular value exists within an array field by doing the following:

from pyspark.sql.functions import array_contains
spark_df.filter(array_contains(spark_df.array_column_name, "value that I want")).show() 

Is there a way to get the index of where in the array the item was found? It seems like that should exist, but I am not finding it. Thank you.

like image 421
user1624577 Avatar asked Dec 12 '18 19:12

user1624577


People also ask

How do I select specific columns in Spark DataFrame?

You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.

What does .collect do in Pyspark?

Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program.

What is Lpad in Pyspark?

Left-pad the string column with pad to the given length len . If the string column is longer than len , the return value is shortened to len characters.


1 Answers

In spark 2.4+, there's the array_position function:

df = spark.createDataFrame([(["c", "b", "a"],), ([],)], ['data'])
df.show()
#+---------+
#|     data|
#+---------+
#|[c, b, a]|
#|       []|
#+---------+

from pyspark.sql.functions import array_position
df.select(df.data, array_position(df.data, "a").alias('a_pos')).show()
#+---------+-----+
#|     data|a_pos|
#+---------+-----+
#|[c, b, a]|    3|
#|       []|    0|
#+---------+-----+

Notes from the docs:

  1. Locates the position of only the first occurrence of the given value in the given array;

  2. The position is not zero based, but 1 based index. Returns 0 if the given value could not be found in the array.

like image 54
Psidom Avatar answered Oct 11 '22 20:10

Psidom