Based on the following DataFrame
:
val client = Seq((1,"A",10),(2,"A",5),(3,"B",56)).toDF("ID","Categ","Amnt") +---+-----+----+ | ID|Categ|Amnt| +---+-----+----+ | 1| A| 10| | 2| A| 5| | 3| B| 56| +---+-----+----+
I would like to to obtain the number of ID and the total amount by category:
+-----+-----+---------+ |Categ|count|sum(Amnt)| +-----+-----+---------+ | B| 1| 56| | A| 2| 15| +-----+-----+---------+
Is it possible to do the count and the sum without having to do a join?
client.groupBy("Categ").count .join(client.withColumnRenamed("Categ","cat") .groupBy("cat") .sum("Amnt"), 'Categ === 'cat) .drop("cat")
Maybe something like that:
client.createOrReplaceTempView("client") spark.sql("SELECT Categ count(Categ) sum(Amnt) FROM client GROUP BY Categ").show()
Use DataFrame. groupby(). sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object which contains an aggregate function sum() to calculate a sum of a given column for each group.
PySpark – sumDistinct() sumDistinct() in PySpark returns the distinct total (sum) value from a particular column in the DataFrame. It will return the sum by considering only unique values. It will not take duplicate values to form a sum.
Use count() by Column Name Use pandas DataFrame. groupby() to group the rows by column and use count() method to get the count for each group by ignoring None and Nan values. It works with non-floating type data as well.
Python provides an inbuilt function sum() which sums up the numbers in the list. Syntax: sum(iterable, start) iterable : iterable can be anything list , tuples or dictionaries , but most importantly it should be numbers. start : this start is added to the sum of numbers in the iterable.
I'm giving different example than yours
multiple group functions are possible like this. try it accordingly
// In 1.3.x, in order for the grouping column "department" to show up, // it must be included explicitly as part of the agg function call. df.groupBy("department").agg($"department", max("age"), sum("expense")) // In 1.4+, grouping column "department" is included automatically. df.groupBy("department").agg(max("age"), sum("expense"))
import org.apache.spark.sql.{DataFrame, SparkSession} import org.apache.spark.sql.functions._ val spark: SparkSession = SparkSession .builder.master("local") .appName("MyGroup") .getOrCreate() import spark.implicits._ val client: DataFrame = spark.sparkContext.parallelize( Seq((1,"A",10),(2,"A",5),(3,"B",56)) ).toDF("ID","Categ","Amnt") client.groupBy("Categ").agg(sum("Amnt"),count("ID")).show()
+-----+---------+---------+ |Categ|sum(Amnt)|count(ID)| +-----+---------+---------+ | B| 56| 1| | A| 15| 2| +-----+---------+---------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With