I have the data in the dataframe as below:
datetime | userId | memberId | value | 2016-04-06 16:36:... | 1234 | 111 | 1 2016-04-06 17:35:... | 1234 | 222 | 5 2016-04-06 17:50:... | 1234 | 111 | 8 2016-04-06 18:36:... | 1234 | 222 | 9 2016-04-05 16:36:... | 4567 | 111 | 1 2016-04-06 17:35:... | 4567 | 222 | 5 2016-04-06 18:50:... | 4567 | 111 | 8 2016-04-06 19:36:... | 4567 | 222 | 9
I need to find the max(datetime) groupby userid,memberid. When I tried as below:
df2 = df.groupBy('userId','memberId').max('datetime')
I'm getting error as:
org.apache.spark.sql.AnalysisException: "datetime" is not a numeric column. Aggregation function can only be applied on a numeric column.;
The output I desired is as follows:
userId | memberId | datetime 1234 | 111 | 2016-04-06 17:50:... 1234 | 222 | 2016-04-06 18:36:... 4567 | 111 | 2016-04-06 18:50:... 4567 | 222 | 2016-04-06 19:36:...
Can someone please help me how I get the max date among the given data using PySpark dataframes?
Using the max () method, we can get the maximum value from the column, and finally, we can use the collect() method to get the maximum from the column. Where, df is the input PySpark DataFrame. column_name is the column to get the maximum value.
Spark SQL provides last_day() function, which returns/get the last day of a month when the input Date is in yyyy-MM-dd format. For example, 2019-01-31 would be returned for input date 2019-01-25 , where 31 is the last day in January month.
Spark provides a few hash functions like md5 , sha1 and sha2 (incl. SHA-224, SHA-256, SHA-384, and SHA-512). These functions can be used in Spark SQL or in DataFrame transformations using PySpark, Scala, etc. This article provides a simple summary about these commonly used functions.
For non-numeric but Orderable
types you can use agg
with max
directly:
from pyspark.sql.functions import col, max as max_ df = sc.parallelize([ ("2016-04-06 16:36", 1234, 111, 1), ("2016-04-06 17:35", 1234, 111, 5), ]).toDF(["datetime", "userId", "memberId", "value"]) (df.withColumn("datetime", col("datetime").cast("timestamp")) .groupBy("userId", "memberId") .agg(max_("datetime"))) ## +------+--------+--------------------+ ## |userId|memberId| max(datetime)| ## +------+--------+--------------------+ ## | 1234| 111|2016-04-06 17:35:...| ## +------+--------+--------------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With