spark dataframe groupby multiple times

Tags:

scala

apache-spark

val df = (Seq((1, "a", "10"),(1,"b", "12"),(1,"c", "13"),(2, "a", "14"),
              (2,"c", "11"),(1,"b","12" ),(2, "c", "12"),(3,"r", "11")).
          toDF("col1", "col2", "col3"))

So I have a spark dataframe with 3 columns.

Click to copy

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   a|  10|
|   1|   b|  12|
|   1|   c|  13|
|   2|   a|  14|
|   2|   c|  11|
|   1|   b|  12|
|   2|   c|  12|
|   3|   r|  11|
+----+----+----+

My requirement is actually I need to perform two levels of groupby as explained below.

Level1: If I do groupby on col1 and do a sum of Col3. I will get below two columns. 1. col1 2. sum(col3) I will loose col2 here.

Level2: If i want to again group by on col1 and col2 and do a sum of Col3 I will get below 3 columns. 1. col1 2. col2 3. sum(col3)

My requirement is actually I need to perform two levels of groupBy and have these two columns(sum(col3) of level1, sum(col3) of level2) in a final one dataframe.

How can I do this, can anyone explain?

spark : 1.6.2 Scala : 2.10

383

asked Jan 20 '17 19:01

Ramesh

1 Answers

One option is to do the two sum separately and then join them back:

Click to copy

(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
    join(df.groupBy("col1").agg(sum($"col3").as("sum_level1")), Seq("col1")).show)

+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
|   2|   c|      23.0|      37.0|
|   2|   a|      14.0|      37.0|
|   1|   c|      13.0|      47.0|
|   1|   b|      24.0|      47.0|
|   3|   r|      11.0|      11.0|
|   1|   a|      10.0|      47.0|
+----+----+----------+----------+

Another option is to use the window functions, considering the fact that the level1_sum is the sum of level2_sum grouped by col1:

Click to copy

import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"col1")

(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
    withColumn("sum_level1", sum($"sum_level2").over(w)).show)

+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
|   1|   c|      13.0|      47.0|
|   1|   b|      24.0|      47.0|
|   1|   a|      10.0|      47.0|
|   3|   r|      11.0|      11.0|
|   2|   c|      23.0|      37.0|
|   2|   a|      14.0|      37.0|
+----+----+----------+----------+

136

answered Sep 30 '22 23:09

Psidom

Related questions
                            
                                Scala case classes with Mixin traits
                            
                                How can I calculate exact median with Apache Spark?
                            
                                Spark SQL nested withColumn
                            
                                Why doesn't Scala Source close the underlying InputStream?
                            
                                Functional code for looping with early exit
                            
                                How to serve uploaded files in Play!2 using Scala?
                            
                                Is it possible to make json4s not to throw exception when required field is missing?
                            
                                Play!: Does Slick's DDL replace Evolutions?
                            
                                Play 2.2.2 with IntelliJ 13 & SBT 0.13 cant run - No main class detected
                            
                                Proper way to stop Akka Streams on condition
                            
                                What is execution context in Scala?
                            
                                Abstract Types / Type Parameters in Scala
                            
                                Creating Android apps without Java
                            
                                Scala: Yielding from one type of collection to another
                            
                                Run multiple futures in parallel, return default value on timeout
                            
                                scala .seq vs .toSeq
                            
                                Warning about reflective access of structural type member in Scala
                            
                                Handling Doubles in ScalaTest
                            
                                Viewing the contents in <function1> from the scala repl
                            
                                How do I tell sbt to use a nightly build of Scala 2.12 or 2.13?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

spark dataframe groupby multiple times

Tags:

scala

apache-spark

Ramesh

People also ask

1 Answers

Psidom

Recent Activity

Donate For Us