Here is my current code:
pipe_exec_df_final_grouped = pipe_exec_df_final.groupBy("application_id").agg(collect_list("table_name").alias("tables"))
However, in my collected list, I would like multiple column values, so the aggregated column would be an array of arrays. Currently the result look like this:
1|[a,b,c,d]
2|[e,f,g,h]
However, I would also like to keep another column attached to the aggragation (lets call it 'status' column name). So my new output would be:
1|[[a,pass],[b,fail],[c,fail],[d,pass]]
...
I tried collect_list("table_name, status")
however collect_list
only takes one column name. How can I accomplish what I am trying to do?
Use array
to collect columns into an array column first, then apply collect_list
:
df.groupBy(...).agg(collect_list(array("table_name", "status")))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With