Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do count(*) within a spark dataframe groupBy

My intention is to do the equivalent of the basic sql

select shipgrp, shipstatus, count(*) cnt 
from shipstatus group by shipgrp, shipstatus

The examples that I have seen for spark dataframes include rollups by other columns: e.g.

df.groupBy($"shipgrp", $"shipstatus").agg(sum($"quantity"))

But no other column is needed in my case shown above. So what is the syntax and/or method call combination here?

Update A reader has suggested this question were a duplicate of dataframe: how to groupBy/count then filter on count in Scala : but that one is about filtering by count : there is no filtering here.

like image 821
WestCoastProjects Avatar asked Sep 26 '17 02:09

WestCoastProjects


People also ask

How do you do a groupBy count in PySpark DataFrame?

PySpark Groupby Count is used to get the number of records for each group. So to perform the count, first, you need to perform the groupBy() on DataFrame which groups the records based on single or multiple column values, and then do the count() to get the number of records for each group.

How do you count on groupBy?

Use count() by Column Namegroupby() to group the rows by column and use count() method to get the count for each group by ignoring None and Nan values. It works with non-floating type data as well. The below example does the grouping on Courses column and calculates count how many times each value is present.

How do you count a groupBy in pandas?

The most simple method for pandas groupby count is by using the in-built pandas method named size(). It returns a pandas series that possess the total number of row count for each group. The basic working of the size() method is the same as len() method and hence, it is not affected by NaN values in the dataset.

How do I get other columns with Spark DataFrame groupBy?

1 Answer. Suppose you have a df that includes columns “name” and “age”, and on these two columns you want to perform groupBY. Now, in order to get other columns also after doing a groupBy you can use join function. Now, data_joined will have all columns including the count values.


1 Answers

You can similarly do count("*") in spark agg function:

df.groupBy("shipgrp", "shipstatus").agg(count("*").as("cnt"))

val df = Seq(("a", 1), ("a", 1), ("b", 2), ("b", 3)).toDF("A", "B")

df.groupBy("A", "B").agg(count("*").as("cnt")).show
+---+---+---+
|  A|  B|cnt|
+---+---+---+
|  b|  2|  1|
|  a|  1|  2|
|  b|  3|  1|
+---+---+---+
like image 114
Psidom Avatar answered Sep 19 '22 21:09

Psidom