My intention is to do the equivalent of the basic sql <pre class="prettyprint"><code>select shipgrp, shipstatus, count(*) cnt from shipstatus group by shipgrp, shipstatus </code></pre> The examples that I have seen for spark dataframes include rollups by other columns: e.g. <pre class="prettyprint"><code>df.groupBy($"shipgrp", $"shipstatus").agg(sum($"quantity")) </code></pre> But no other column is needed in my case shown above. So what is the syntax and/or method call combination here? Update A reader has suggested this question were a duplicate of dataframe: how to groupBy/count then filter on count in Scala : but that one is about filtering by <code>count</code> : there is no filtering here.

You can similarly do <code>count("*")</code> in spark <code>agg</code> function: <pre class="prettyprint"><code>df.groupBy("shipgrp", "shipstatus").agg(count("*").as("cnt")) </code></pre> <hr> <pre class="prettyprint"><code>val df = Seq(("a", 1), ("a", 1), ("b", 2), ("b", 3)).toDF("A", "B") df.groupBy("A", "B").agg(count("*").as("cnt")).show +---+---+---+ | A| B|cnt| +---+---+---+ | b| 2| 1| | a| 1| 2| | b| 3| 1| +---+---+---+ </code></pre>

How to do count(*) within a spark dataframe groupBy

Tags:

scala

apache-spark

apache-spark-sql

My intention is to do the equivalent of the basic sql

select shipgrp, shipstatus, count(*) cnt 
from shipstatus group by shipgrp, shipstatus

The examples that I have seen for spark dataframes include rollups by other columns: e.g.

df.groupBy($"shipgrp", $"shipstatus").agg(sum($"quantity"))

But no other column is needed in my case shown above. So what is the syntax and/or method call combination here?

Update A reader has suggested this question were a duplicate of dataframe: how to groupBy/count then filter on count in Scala : but that one is about filtering by count : there is no filtering here.

821

asked Sep 26 '17 02:09

WestCoastProjects

1 Answers

You can similarly do count("*") in spark agg function:

df.groupBy("shipgrp", "shipstatus").agg(count("*").as("cnt"))

val df = Seq(("a", 1), ("a", 1), ("b", 2), ("b", 3)).toDF("A", "B")

df.groupBy("A", "B").agg(count("*").as("cnt")).show
+---+---+---+
|  A|  B|cnt|
+---+---+---+
|  b|  2|  1|
|  a|  1|  2|
|  b|  3|  1|
+---+---+---+

114

answered Sep 19 '22 21:09

Psidom

Related questions
                            
                                Is this a bug in Scala 2.9.1 lazy implementation or just an artifact of decompilation
                            
                                Has Scala any equivalence to Haskell's undefined?
                            
                                Java/Scala obtain a Field reference in a typesafe way
                            
                                Which new features are (or will be) added to Scaladoc in Scala 2.10? [closed]
                            
                                Boundaries between Services, Filters, and Codecs in Finagle
                            
                                Is there any functional language compiler/runtime which optimizes chained iterations?
                            
                                Extract second tuple element in list of tuples
                            
                                i really would like sbt and its console to work under cygwin any way you think it can be done?
                            
                                Why sbt compile doesn't copy unmanaged resources to classpath?
                            
                                Custom Scala enum, most elegant version searched
                            
                                Scala: how to understand the flatMap method of Try?
                            
                                Why does the build fail with unresolved dependency: com.typesafe.sbt#sbt-native-packager;0.7.4?
                            
                                jvm options not passed on to forked process
                            
                                Building Apache Spark using SBT: Invalid or corrupt jarfile
                            
                                Convert Java's Integer to Scala's Int
                            
                                Explanation for - No Reflection involved
                            
                                Is scala sorting stable?
                            
                                MongoDB scala driver: what is a best way to return Future when working with Observer callbacks?
                            
                                Can SparkContext and StreamingContext co-exist in the same program?
                            
                                How to implement breadth first search in Scala with FP

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With