Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark agg to collect a single list for multiple columns

Here is my current code:

pipe_exec_df_final_grouped = pipe_exec_df_final.groupBy("application_id").agg(collect_list("table_name").alias("tables"))

However, in my collected list, I would like multiple column values, so the aggregated column would be an array of arrays. Currently the result look like this:

1|[a,b,c,d]
2|[e,f,g,h]

However, I would also like to keep another column attached to the aggragation (lets call it 'status' column name). So my new output would be:

1|[[a,pass],[b,fail],[c,fail],[d,pass]]
...

I tried collect_list("table_name, status") however collect_list only takes one column name. How can I accomplish what I am trying to do?

like image 565
test acc Avatar asked Sep 10 '25 13:09

test acc


1 Answers

Use array to collect columns into an array column first, then apply collect_list:

df.groupBy(...).agg(collect_list(array("table_name", "status")))
like image 94
Psidom Avatar answered Sep 13 '25 04:09

Psidom