Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Column alias after groupBy in pyspark

I need the resulting data frame in the line below, to have an alias name "maxDiff" for the max('diff') column after groupBy. However, the below line does not makeany change, nor throw an error.

 grpdf = joined_df.groupBy(temp1.datestamp).max('diff').alias("maxDiff") 
like image 519
mhn Avatar asked Nov 04 '15 07:11

mhn


People also ask

How do you rename an aggregated column in PySpark?

Use alias() Use sum() SQL function to perform summary aggregation that returns a Column type, and use alias() of Column type to rename a DataFrame column.

How do you use groupBy and count in PySpark?

When we perform groupBy() on PySpark Dataframe, it returns GroupedData object which contains below aggregate functions. count() – Use groupBy() count() to return the number of rows for each group. mean() – Returns the mean of values for each group. max() – Returns the maximum of values for each group.


2 Answers

You can use agg instead of calling max method:

from pyspark.sql.functions import max  joined_df.groupBy(temp1.datestamp).agg(max("diff").alias("maxDiff")) 

Similarly in Scala

import org.apache.spark.sql.functions.max  joined_df.groupBy($"datestamp").agg(max("diff").alias("maxDiff")) 

or

joined_df.groupBy($"datestamp").agg(max("diff").as("maxDiff")) 
like image 57
zero323 Avatar answered Sep 27 '22 00:09

zero323


This is because you are aliasing the whole DataFrame object, not Column. Here's an example how to alias the Column only:

import pyspark.sql.functions as func  grpdf = joined_df \     .groupBy(temp1.datestamp) \     .max('diff') \     .select(func.col("max(diff)").alias("maxDiff")) 
like image 25
Nhor Avatar answered Sep 23 '22 00:09

Nhor