Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to get max(date) from given set of data grouped by some fields using pyspark?

I have the data in the dataframe as below:

  datetime             | userId | memberId | value |     2016-04-06 16:36:...   | 1234   | 111      | 1 2016-04-06 17:35:...   | 1234   | 222      | 5 2016-04-06 17:50:...   | 1234   | 111      | 8 2016-04-06 18:36:...   | 1234   | 222      | 9 2016-04-05 16:36:...   | 4567   | 111      | 1 2016-04-06 17:35:...   | 4567   | 222      | 5 2016-04-06 18:50:...   | 4567   | 111      | 8 2016-04-06 19:36:...   | 4567   | 222      | 9 

I need to find the max(datetime) groupby userid,memberid. When I tried as below:

df2 = df.groupBy('userId','memberId').max('datetime') 

I'm getting error as:

org.apache.spark.sql.AnalysisException: "datetime" is not a numeric column. Aggregation function can only be applied on a numeric column.; 

The output I desired is as follows:

userId | memberId | datetime 1234   |  111     | 2016-04-06 17:50:... 1234   |  222     | 2016-04-06 18:36:... 4567   |  111     | 2016-04-06 18:50:... 4567   |  222     | 2016-04-06 19:36:... 

Can someone please help me how I get the max date among the given data using PySpark dataframes?

like image 630
Bhuvan Avatar asked Jul 14 '16 15:07

Bhuvan


People also ask

How do you find the max date in PySpark?

Using the max () method, we can get the maximum value from the column, and finally, we can use the collect() method to get the maximum from the column. Where, df is the input PySpark DataFrame. column_name is the column to get the maximum value.

How do you get the latest date in PySpark?

Spark SQL provides last_day() function, which returns/get the last day of a month when the input Date is in yyyy-MM-dd format. For example, 2019-01-31 would be returned for input date 2019-01-25 , where 31 is the last day in January month.

What is Spark hash?

Spark provides a few hash functions like md5 , sha1 and sha2 (incl. SHA-224, SHA-256, SHA-384, and SHA-512). These functions can be used in Spark SQL or in DataFrame transformations using PySpark, Scala, etc. This article provides a simple summary about these commonly used functions.


Video Answer


1 Answers

For non-numeric but Orderable types you can use agg with max directly:

from pyspark.sql.functions import col, max as max_  df = sc.parallelize([     ("2016-04-06 16:36", 1234, 111, 1),     ("2016-04-06 17:35", 1234, 111, 5), ]).toDF(["datetime", "userId", "memberId", "value"])  (df.withColumn("datetime", col("datetime").cast("timestamp"))     .groupBy("userId", "memberId")     .agg(max_("datetime")))  ## +------+--------+--------------------+ ## |userId|memberId|       max(datetime)| ## +------+--------+--------------------+ ## |  1234|     111|2016-04-06 17:35:...| ## +------+--------+--------------------+ 
like image 141
zero323 Avatar answered Oct 15 '22 00:10

zero323