Spark agg to collect a single list for multiple columns

Question

Here is my current code:

pipe_exec_df_final_grouped = pipe_exec_df_final.groupBy("application_id").agg(collect_list("table_name").alias("tables"))

However, in my collected list, I would like multiple column values, so the aggregated column would be an array of arrays. Currently the result look like this:

1|[a,b,c,d]
2|[e,f,g,h]

However, I would also like to keep another column attached to the aggragation (lets call it 'status' column name). So my new output would be:

1|[[a,pass],[b,fail],[c,fail],[d,pass]]
...

I tried collect_list("table_name, status") however collect_list only takes one column name. How can I accomplish what I am trying to do?

Psidom · Accepted Answer

Use array to collect columns into an array column first, then apply collect_list:

df.groupBy(...).agg(collect_list(array("table_name", "status")))

Spark agg to collect a single list for multiple columns

Tags:

group-by

scala

apache-spark

apache-spark-sql

test acc

1 Answers

Psidom

Recent Activity

Donate For Us

Spark agg to collect a single list for multiple columns

Tags:

group-by

scala

apache-spark

apache-spark-sql

test acc

1 Answers

Psidom

Related questions

Recent Activity

Donate For Us