Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any way to get max value from a column in Pyspark other than collect()?

I want to get the maximum value from a date type column in a pyspark dataframe. Currently, I am using a command like this:

df.select('col1').distinct().orderBy('col1').collect()[0]['col1']

Here "col1" is the datetime type column. It works fine but I want to avoid the use of collect() here as i am doubtful that my driver may get overflowed.

Any advice would be helpful.

like image 278
Samyak Jain Avatar asked Oct 22 '25 07:10

Samyak Jain


1 Answers

No need to sort, you can just select the maximum:

res = df.select(max(col('col1')).alias('max_col1')).first().max_col1

Or you can use selectExpr

res = df1.selectExpr('max(diff) as max_col1').first().max_col1
like image 138
ernest_k Avatar answered Oct 23 '25 22:10

ernest_k



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!