I need the resulting data frame in the line below, to have an alias name "maxDiff" for the max('diff') column after groupBy. However, the below line does not makeany change, nor throw an error.
grpdf = joined_df.groupBy(temp1.datestamp).max('diff').alias("maxDiff")
Use alias() Use sum() SQL function to perform summary aggregation that returns a Column type, and use alias() of Column type to rename a DataFrame column.
When we perform groupBy() on PySpark Dataframe, it returns GroupedData object which contains below aggregate functions. count() – Use groupBy() count() to return the number of rows for each group. mean() – Returns the mean of values for each group. max() – Returns the maximum of values for each group.
You can use agg
instead of calling max
method:
from pyspark.sql.functions import max joined_df.groupBy(temp1.datestamp).agg(max("diff").alias("maxDiff"))
Similarly in Scala
import org.apache.spark.sql.functions.max joined_df.groupBy($"datestamp").agg(max("diff").alias("maxDiff"))
or
joined_df.groupBy($"datestamp").agg(max("diff").as("maxDiff"))
This is because you are aliasing the whole DataFrame
object, not Column
. Here's an example how to alias the Column
only:
import pyspark.sql.functions as func grpdf = joined_df \ .groupBy(temp1.datestamp) \ .max('diff') \ .select(func.col("max(diff)").alias("maxDiff"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With