Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate sum and count in a single groupBy?

Based on the following DataFrame:

val client = Seq((1,"A",10),(2,"A",5),(3,"B",56)).toDF("ID","Categ","Amnt") +---+-----+----+ | ID|Categ|Amnt| +---+-----+----+ |  1|    A|  10| |  2|    A|   5| |  3|    B|  56| +---+-----+----+ 

I would like to to obtain the number of ID and the total amount by category:

+-----+-----+---------+ |Categ|count|sum(Amnt)| +-----+-----+---------+ |    B|    1|       56| |    A|    2|       15| +-----+-----+---------+ 

Is it possible to do the count and the sum without having to do a join?

client.groupBy("Categ").count       .join(client.withColumnRenamed("Categ","cat")            .groupBy("cat")            .sum("Amnt"), 'Categ === 'cat)       .drop("cat") 

Maybe something like that:

client.createOrReplaceTempView("client") spark.sql("SELECT Categ count(Categ) sum(Amnt) FROM client GROUP BY Categ").show() 
like image 222
ulrich Avatar asked Nov 06 '16 12:11

ulrich


People also ask

How do you sum in Groupby?

Use DataFrame. groupby(). sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object which contains an aggregate function sum() to calculate a sum of a given column for each group.

How do you get the sum of count in PySpark?

PySpark – sumDistinct() sumDistinct() in PySpark returns the distinct total (sum) value from a particular column in the DataFrame. It will return the sum by considering only unique values. It will not take duplicate values to form a sum.

How do you count a Groupby in pandas?

Use count() by Column Name Use pandas DataFrame. groupby() to group the rows by column and use count() method to get the count for each group by ignoring None and Nan values. It works with non-floating type data as well.

How do you sum a count in python?

Python provides an inbuilt function sum() which sums up the numbers in the list. Syntax: sum(iterable, start) iterable : iterable can be anything list , tuples or dictionaries , but most importantly it should be numbers. start : this start is added to the sum of numbers in the iterable.


1 Answers

I'm giving different example than yours

multiple group functions are possible like this. try it accordingly

  // In 1.3.x, in order for the grouping column "department" to show up, // it must be included explicitly as part of the agg function call. df.groupBy("department").agg($"department", max("age"), sum("expense"))  // In 1.4+, grouping column "department" is included automatically. df.groupBy("department").agg(max("age"), sum("expense")) 

import org.apache.spark.sql.{DataFrame, SparkSession} import org.apache.spark.sql.functions._  val spark: SparkSession = SparkSession       .builder.master("local")       .appName("MyGroup")       .getOrCreate() import spark.implicits._     val client: DataFrame = spark.sparkContext.parallelize( Seq((1,"A",10),(2,"A",5),(3,"B",56)) ).toDF("ID","Categ","Amnt")  client.groupBy("Categ").agg(sum("Amnt"),count("ID")).show() 

+-----+---------+---------+ |Categ|sum(Amnt)|count(ID)| +-----+---------+---------+ |    B|       56|        1| |    A|       15|        2| +-----+---------+---------+ 
like image 159
Ram Ghadiyaram Avatar answered Sep 17 '22 20:09

Ram Ghadiyaram