i use : `
dataset.withColumn("lead",lead(dataset.col(start_date),1).over(orderBy(start_date)));
` i just want to add group by trackId so lead work over each group by as any agg function :
+----------+---------------------------------------------+
| trackId | start_time | end_time | lead |
+-----+--------------------------------------------------+
| 1 | 12:00:00 | 12:04:00 | 12:05:00 |
+----------+---------------------------------------------+
| 1 | 12:05:00 | 12:08:00 | 12:20:00 |
+----------+---------------------------------------------+
| 1 | 12:20:00 | 12:22:00 | null |
+----------+---------------------------------------------+
| 2 | 13:00:00 | 13:04:00 | 13:05:00 |
+----------+---------------------------------------------+
| 2 | 13:05:00 | 13:08:00 | 13:20:00 |
+----------+---------------------------------------------+
| 2 | 13:20:00 | 13:22:00 | null |
+----------+---------------------------------------------+
any help how to do that ?
The LEAD function is used to access data from SUBSEQUENT rows along with data from the current row. The LAG function is used to access data from PREVIOUS rows along with data from the current row. An ORDER BY clause is required when working with LEAD and LAG functions, but a PARTITION BY clause is optional.
Spark. Lag is used to access data from n rows prior, and allows the current row to access that data. Lead follows the same logic, but is used to access data from n rows ahead. An example use case is finding previous and next order dates for customers.
Description. The SORT BY clause is used to return the result rows sorted within each partition in the user specified order. When there is more than one partition SORT BY may return result that is partially ordered. This is different than ORDER BY clause which guarantees a total order of the output.
Just like LAG() , LEAD() is a window function and requires an OVER clause. And as with LAG() , LEAD() must be accompanied by an ORDER BY in the OVER clause.
All you are missing is the Window
keyword and partitionBy
method call
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
dataset.withColumn("lead",lead(col("start_time"),1).over(Window.partitionBy("trackId").orderBy("start_time")))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With