I am using Spark with Java and I have a dataframe like this:
id | array_column
-------------------
12 | [a:123, b:125, c:456]
13 | [a:443, b:225, c:126]
I want to explode array_column
with the same id, however explode
doesn't work, because I want dataframe to become:
id | a | b | c
-------------------
12 |123 |125 | 456
13 |443 |225 | 126
The following approach will work on variable length lists in array_column
. The approach uses explode
to expand the list of string elements in array_column
before splitting each string element using :
into two different columns col_name
and col_val
respectively. Finally a pivot is used with a group by to transpose the data into the desired format.
The following example uses the pyspark api but can easily be translated to the java/scala apis as they are similar. I assumed your dataset is in a dataframe named input_df
from pyspark.sql import functions as F
output_df = (
input_df.select("id",F.explode("array_column").alias("acol"))
.select(
"id",
F.split("acol",":")[0].alias("col_name"),
F.split("acol",":")[1].cast("integer").alias("col_val")
)
.groupBy("id")
.pivot("col_name")
.max("col_val")
)
Let me know if this works for you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With