Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to use spark lag and lead over group by and order by

i use : `

dataset.withColumn("lead",lead(dataset.col(start_date),1).over(orderBy(start_date)));

` i just want to add group by trackId so lead work over each group by as any agg function :

+----------+---------------------------------------------+
|  trackId |  start_time    |  end_time   |      lead    |
+-----+--------------------------------------------------+
|  1       | 12:00:00       |   12:04:00  |     12:05:00 |
+----------+---------------------------------------------+
|  1       | 12:05:00       |   12:08:00  |    12:20:00  |  
+----------+---------------------------------------------+
|  1       | 12:20:00       |   12:22:00  |     null     | 
+----------+---------------------------------------------+
|  2       | 13:00:00       |   13:04:00  |    13:05:00 |
+----------+---------------------------------------------+
|  2       | 13:05:00       |   13:08:00  |    13:20:00  |  
+----------+---------------------------------------------+
|  2       | 13:20:00       |   13:22:00  |     null     | 
+----------+---------------------------------------------+

any help how to do that ?

like image 219
sandevfares Avatar asked May 01 '18 08:05

sandevfares


People also ask

How do you use lead and lag functions?

The LEAD function is used to access data from SUBSEQUENT rows along with data from the current row. The LAG function is used to access data from PREVIOUS rows along with data from the current row. An ORDER BY clause is required when working with LEAD and LAG functions, but a PARTITION BY clause is optional.

What is spark lag and lead?

Spark. Lag is used to access data from n rows prior, and allows the current row to access that data. Lead follows the same logic, but is used to access data from n rows ahead. An example use case is finding previous and next order dates for customers.

What is difference between ORDER BY and sort by in spark?

Description. The SORT BY clause is used to return the result rows sorted within each partition in the user specified order. When there is more than one partition SORT BY may return result that is partially ordered. This is different than ORDER BY clause which guarantees a total order of the output.

Can we use lead and lag without over clause?

Just like LAG() , LEAD() is a window function and requires an OVER clause. And as with LAG() , LEAD() must be accompanied by an ORDER BY in the OVER clause.


1 Answers

All you are missing is the Window keyword and partitionBy method call

import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
dataset.withColumn("lead",lead(col("start_time"),1).over(Window.partitionBy("trackId").orderBy("start_time")))
like image 111
Ramesh Maharjan Avatar answered Nov 10 '22 14:11

Ramesh Maharjan