I have a Dataframe with one column. Each row of that column has an Array of String values:
Values in my Spark 2.2 Dataframe
["123", "abc", "2017", "ABC"] ["456", "def", "2001", "ABC"] ["789", "ghi", "2017", "DEF"] org.apache.spark.sql.DataFrame = [col: array] root |-- col: array (nullable = true) | |-- element: string (containsNull = true)
What is the best way to access elements in the array? For example, I would like extract distinct values in the fourth element for the year 2017 (answer "ABC", "DEF").
You can use array subscript (or index) to access any element stored in array. Subscript starts with 0, which means arr[0] represents the first element in the array arr. In general arr[n-1] can be used to access nth element of an array.
Accessing an Element of an Array To access an individual element of an array, use the name of the array name followed by the index of the element in square brackets. Array indices start at 0 and end at size-1: array_name[index]; accesses the index'th element of array_name starting at zero.
To query if the array field contains at least one element with the specified value, use the filter { <field>: <value> } where <value> is the element value. To specify conditions on the elements in the array field, use query operators in the query filter document: { <array field>: { <operator1>: <value1>, ... } }
Since Spark 2.4.0, there is a new function element_at($array_column, $index)
.
See Spark docs
df.where($"col".getItem(2) === lit("2017")).select($"col".getItem(3))
see getItem
from https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With