Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark explode array column to columns

I am using Spark with Java and I have a dataframe like this:

id  | array_column
-------------------
12  | [a:123, b:125, c:456]
13  | [a:443, b:225, c:126]

I want to explode array_column with the same id, however explode doesn't work, because I want dataframe to become:

id  | a  | b  | c
-------------------
12  |123 |125 | 456 
13  |443 |225 | 126
like image 766
Ofir Avatar asked Mar 01 '23 10:03

Ofir


1 Answers

The following approach will work on variable length lists in array_column. The approach uses explode to expand the list of string elements in array_column before splitting each string element using : into two different columns col_name and col_val respectively. Finally a pivot is used with a group by to transpose the data into the desired format.

The following example uses the pyspark api but can easily be translated to the java/scala apis as they are similar. I assumed your dataset is in a dataframe named input_df

from pyspark.sql import functions as F

output_df = (
    input_df.select("id",F.explode("array_column").alias("acol"))
            .select(
                "id",
                F.split("acol",":")[0].alias("col_name"),
                F.split("acol",":")[1].cast("integer").alias("col_val")
            )
            .groupBy("id")
            .pivot("col_name")
            .max("col_val")
)

Let me know if this works for you.

like image 127
ggordon Avatar answered Mar 10 '23 09:03

ggordon