how to get max(date) from given set of data grouped by some fields using pyspark?

Tags:

I have the data in the dataframe as below:

  datetime             | userId | memberId | value |     2016-04-06 16:36:...   | 1234   | 111      | 1 2016-04-06 17:35:...   | 1234   | 222      | 5 2016-04-06 17:50:...   | 1234   | 111      | 8 2016-04-06 18:36:...   | 1234   | 222      | 9 2016-04-05 16:36:...   | 4567   | 111      | 1 2016-04-06 17:35:...   | 4567   | 222      | 5 2016-04-06 18:50:...   | 4567   | 111      | 8 2016-04-06 19:36:...   | 4567   | 222      | 9

I need to find the max(datetime) groupby userid,memberid. When I tried as below:

df2 = df.groupBy('userId','memberId').max('datetime')

I'm getting error as:

org.apache.spark.sql.AnalysisException: "datetime" is not a numeric column. Aggregation function can only be applied on a numeric column.;

The output I desired is as follows:

userId | memberId | datetime 1234   |  111     | 2016-04-06 17:50:... 1234   |  222     | 2016-04-06 18:36:... 4567   |  111     | 2016-04-06 18:50:... 4567   |  222     | 2016-04-06 19:36:...

Can someone please help me how I get the max date among the given data using PySpark dataframes?

630

asked Jul 14 '16 15:07

Bhuvan

Video Answer

1 Answers

For non-numeric but Orderable types you can use agg with max directly:

from pyspark.sql.functions import col, max as max_  df = sc.parallelize([     ("2016-04-06 16:36", 1234, 111, 1),     ("2016-04-06 17:35", 1234, 111, 5), ]).toDF(["datetime", "userId", "memberId", "value"])  (df.withColumn("datetime", col("datetime").cast("timestamp"))     .groupBy("userId", "memberId")     .agg(max_("datetime")))  ## +------+--------+--------------------+ ## |userId|memberId|       max(datetime)| ## +------+--------+--------------------+ ## |  1234|     111|2016-04-06 17:35:...| ## +------+--------+--------------------+

141

answered Oct 15 '22 00:10

zero323

Related questions
                            
                                Execution order of WHEN clauses in a CASE statement
                            
                                Azure SQL Database vs. MS SQL Server on Dedicated Machine
                            
                                How to sum up time field in SQL Server
                            
                                What is the purpose of putting an 'N' in front of function parameters in TSQL?
                            
                                Pivot rows to columns without aggregate
                            
                                How to generate and manually insert a uniqueidentifier in SQL Server?
                            
                                Double Dot table qualifier
                            
                                Does limiting a query to one record improve performance
                            
                                How To Do Percent/Total in SQL?
                            
                                Postgres trigger after insert accessing NEW
                            
                                Indexing Null Values in PostgreSQL
                            
                                SQL: how to select a single id ("row") that meets multiple criteria from a single column
                            
                                Replace identity column from int to bigint
                            
                                Django nested transactions - “with transaction.atomic()” -- Seeking Clarification
                            
                                How do you query object collections in Java (Criteria/SQL-like)?
                            
                                When to use SQL Table Alias
                            
                                MySQL Include a script within script
                            
                                SQL ANY & ALL Operators
                            
                                SQL query with avg and group by
                            
                                Search for a particular string in Oracle clob column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to get max(date) from given set of data grouped by some fields using pyspark?

Tags:

sql

apache-spark

apache-spark-sql

pyspark

pyspark-sql

Bhuvan

People also ask

Video Answer

1 Answers

zero323

Recent Activity

Donate For Us