I am able to filter a Spark dataframe (in PySpark) based on if a particular value exists within an array field by doing the following:
from pyspark.sql.functions import array_contains
spark_df.filter(array_contains(spark_df.array_column_name, "value that I want")).show()
Is there a way to get the index of where in the array the item was found? It seems like that should exist, but I am not finding it. Thank you.
You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.
Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program.
Left-pad the string column with pad to the given length len . If the string column is longer than len , the return value is shortened to len characters.
In spark 2.4+, there's the array_position
function:
df = spark.createDataFrame([(["c", "b", "a"],), ([],)], ['data'])
df.show()
#+---------+
#| data|
#+---------+
#|[c, b, a]|
#| []|
#+---------+
from pyspark.sql.functions import array_position
df.select(df.data, array_position(df.data, "a").alias('a_pos')).show()
#+---------+-----+
#| data|a_pos|
#+---------+-----+
#|[c, b, a]| 3|
#| []| 0|
#+---------+-----+
Notes from the docs:
Locates the position of only the first occurrence of the given value in the given array;
The position is not zero based, but 1 based index. Returns 0 if the given value could not be found in the array.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With