I want to get the maximum value from a date type column in a pyspark dataframe. Currently, I am using a command like this:
df.select('col1').distinct().orderBy('col1').collect()[0]['col1']
Here "col1" is the datetime type column. It works fine but I want to avoid the use of collect() here as i am doubtful that my driver may get overflowed.
Any advice would be helpful.
No need to sort, you can just select the maximum:
res = df.select(max(col('col1')).alias('max_col1')).first().max_col1
Or you can use selectExpr
res = df1.selectExpr('max(diff) as max_col1').first().max_col1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With