aggregate function Count usage with groupBy in Spark

Tags:

I'm trying to make multiple operations in one line of code in pySpark, and not sure if that's possible for my case.

My intention is not having to save the output as a new dataframe.

My current code is rather simple:

encodeUDF = udf(encode_time, StringType()) new_log_df.cache().withColumn('timePeriod', encodeUDF(col('START_TIME')))   .groupBy('timePeriod')   .agg(     mean('DOWNSTREAM_SIZE').alias("Mean"),     stddev('DOWNSTREAM_SIZE').alias("Stddev")   )   .show(20, False)

And my intention is to add count() after using groupBy, to get, well, the count of records matching each value of timePeriod column, printed\shown as output.

When trying to use groupBy(..).count().agg(..) I get exceptions.

Is there any way to achieve both count() and agg().show() prints, without splitting code to two lines of commands, e.g. :

new_log_df.withColumn(..).groupBy(..).count() new_log_df.withColumn(..).groupBy(..).agg(..).show()

Or better yet, for getting a merged output to agg.show() output - An extra column which states the counted number of records matching the row's value. e.g.:

timePeriod | Mean | Stddev | Num Of Records     X      | 10   |   20   |    315

925

asked Jan 27 '17 09:01

Adiel

1 Answers

count() can be used inside agg() as groupBy expression is same.

With Python

import pyspark.sql.functions as func  new_log_df.cache().withColumn("timePeriod", encodeUDF(new_log_df["START_TIME"]))    .groupBy("timePeriod")   .agg(      func.mean("DOWNSTREAM_SIZE").alias("Mean"),       func.stddev("DOWNSTREAM_SIZE").alias("Stddev"),      func.count(func.lit(1)).alias("Num Of Records")    )   .show(20, False)

pySpark SQL functions doc

With Scala

import org.apache.spark.sql.functions._ //for count()  new_log_df.cache().withColumn("timePeriod", encodeUDF(col("START_TIME")))    .groupBy("timePeriod")   .agg(      mean("DOWNSTREAM_SIZE").alias("Mean"),       stddev("DOWNSTREAM_SIZE").alias("Stddev"),      count(lit(1)).alias("Num Of Records")    )   .show(20, false)

count(1) will count the records by first column which is equal to count("timePeriod")

With Java

import static org.apache.spark.sql.functions.*;  new_log_df.cache().withColumn("timePeriod", encodeUDF(col("START_TIME")))    .groupBy("timePeriod")   .agg(      mean("DOWNSTREAM_SIZE").alias("Mean"),       stddev("DOWNSTREAM_SIZE").alias("Stddev"),      count(lit(1)).alias("Num Of Records")    )   .show(20, false)

answered Sep 26 '22 02:09

mrsrinivas

Related questions
                            
                                How to load a resource from WEB-INF directory of a web archive
                            
                                Why does super.onDestroy() in java Android goes on top in destructors? [duplicate]
                            
                                Try-with-resources and return statements in java
                            
                                Is the stack garbage collected in Java?
                            
                                Write a file in hdfs with Java
                            
                                Java 8 - omitting tedious collect method
                            
                                How to test whether lazy loaded JPA collection is initialized?
                            
                                What is the functionality of setSoTimeout and how it works?
                            
                                How instanceof will work on an interface
                            
                                Adding null values to arraylist
                            
                                Spring Boot - nesting ConfigurationProperties
                            
                                How to flatmap a stream of streams in Java? [duplicate]
                            
                                Eclipse junit testing in the same project
                            
                                Best practices for deploying Java webapps with minimal downtime?
                            
                                Continuing test execution in junit4 even when one of the asserts fails
                            
                                How to map a PostgreSQL array with Hibernate
                            
                                Safe use of HttpURLConnection
                            
                                What is better: multiple "if" statements or one "if" with multiple conditions?
                            
                                Encapsulation vs Data Hiding - Java
                            
                                Why does Jackson 2 not recognize the first capital letter if the leading camel case word is only a single letter long?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

aggregate function Count usage with groupBy in Spark

Tags:

java

scala

apache-spark

apache-spark-sql

pyspark