Spark GroupBy agg collect_list multiple columns

Tags:

I have a question similar to this but the number of columns to be operated by collect_list is given by a name list. For example:

scala> w.show
+---+-----+----+-----+
|iid|event|date|place|
+---+-----+----+-----+
|  A|   D1|  T0|   P1|
|  A|   D0|  T1|   P2|
|  B|   Y1|  T0|   P3|
|  B|   Y2|  T2|   P3|
|  C|   H1|  T0|   P5|
|  C|   H0|  T9|   P5|
|  B|   Y0|  T1|   P2|
|  B|   H1|  T3|   P6|
|  D|   H1|  T2|   P4|
+---+-----+----+-----+


scala> val combList = List("event", "date", "place")
combList: List[String] = List(event, date, place)

scala> val v = w.groupBy("iid").agg(collect_list(combList(0)), collect_list(combList(1)), collect_list(combList(2)))
v: org.apache.spark.sql.DataFrame = [iid: string, collect_list(event): array<string> ... 2 more fields]

scala> v.show
+---+-------------------+------------------+-------------------+
|iid|collect_list(event)|collect_list(date)|collect_list(place)|
+---+-------------------+------------------+-------------------+
|  B|   [Y1, Y2, Y0, H1]|  [T0, T2, T1, T3]|   [P3, P3, P2, P6]|
|  D|               [H1]|              [T2]|               [P4]|
|  C|           [H1, H0]|          [T0, T9]|           [P5, P5]|
|  A|           [D1, D0]|          [T0, T1]|           [P1, P2]|
+---+-------------------+------------------+-------------------+

Is there any way I can apply collect_list to multiple columns inside agg without knowing the number of elements in the combList prior?

561

asked Feb 13 '18 04:02

Jonathan

1 Answers

You can use collect_list(struct(col1, col2)) AS elements.

Example:

df.select("cd_issuer", "cd_doc", "cd_item", "nm_item").printSchema
val outputDf = spark.sql(s"SELECT cd_issuer, cd_doc, collect_list(struct(cd_item, nm_item)) AS item FROM teste GROUP BY cd_issuer, cd_doc")
outputDf.printSchema

df
 |-- cd_issuer: string (nullable = true)
 |-- cd_doc: string (nullable = true)
 |-- cd_item: string (nullable = true)
 |-- nm_item: string (nullable = true)

outputDf
|-- cd_issuer: string (nullable = true)
|-- cd_doc: string (nullable = true)
|-- item: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- cd_item: string (nullable = true)
|    |    |-- nm_item: string (nullable = true)

answered Sep 16 '22 11:09

Rodrigo Fritsch

Related questions
                            
                                How to get SQL row by max of one column, group by another column
                            
                                MySQL count row values WHERE column = value
                            
                                MySQL - Group by range
                            
                                Apply a custom function to a spark dataframe group
                            
                                Search for and remove outliers from a dataframe grouped by a variable
                            
                                What's the R way to do the following group by?
                            
                                SELECT SQL Syntax For Count in WHERE clause
                            
                                Rails - group_by
                            
                                How to zero out all negative numbers in a group-by T-SQL statement
                            
                                Need to select ALL columns while using COUNT/Group By
                            
                                Linq get sum of data group by date
                            
                                Why Mysql's Group By and Oracle's Group by behaviours are different
                            
                                How to calculate mean of all columns, by group?
                            
                                MySQL GROUP BY returns only first row
                            
                                MySQL - Fetching lowest value
                            
                                is it possible to use ORDER BY column not in the GROUP BY?
                            
                                Using GroupBy and Max in LINQ Lambda Expressions
                            
                                Grouping lists into groups of X items per group
                            
                                Removing duplicate results while using UNION SELECT
                            
                                IOS Swift Core Data how to add fields in propertiesToFetch that is not in propertiesToGroupBy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark GroupBy agg collect_list multiple columns

Tags:

group-by

aggregate

spark-dataframe

Jonathan

People also ask

1 Answers

Rodrigo Fritsch

Recent Activity

Donate For Us