Spark DataFrame aggregate column values by key into List

Question

I have a DataFrame that looks like this:

+-----------------+-------+
|Id               | value |
+-----------------+-------+
|             1622| 139685|
|             1622| 182118|
|             1622| 127955|
|             3837|3224815|
|             1622| 727761|
|             1622| 155875|
|             3837|1504923|
|             1622| 139684|
+-----------------+-------+

And I want to turn it into:

    +-----------------+-------------------------------------------+
    |Id               | value                                     |
    +-----------------+-------------------------------------------+
    |             1622|139685,182118,127955,727761,155875,139684  |
    |             3837|3224815,1504923                            |
    +-----------------+-------------------------------------------+

Is this possible with DataFrame functions only or do I need to convert it to and RDD?

David Griffin · Accepted Answer

It is possible with the DataFrame API. Try:

df.groupBy(col("Id"))
  .agg(collect_list(col("value")) as "value")

If instead of an Array you want a String separated by ,, then try this:

df.groupBy(col("Id"))
  .agg(collect_list(col("value")) as "value")
  .withColumn("value", concat_ws(",", col("value")))

Spark DataFrame aggregate column values by key into List

Tags:

dataframe

apache-spark

apache-spark-sql

C.A

1 Answers

David Griffin

Recent Activity

Donate For Us

Spark DataFrame aggregate column values by key into List

Tags:

dataframe

apache-spark

apache-spark-sql

C.A

1 Answers

David Griffin

Related questions

Recent Activity

Donate For Us