how to use spark lag and lead over group by and order by

Tags:

i use : `

dataset.withColumn("lead",lead(dataset.col(start_date),1).over(orderBy(start_date)));

` i just want to add group by trackId so lead work over each group by as any agg function :

+----------+---------------------------------------------+
|  trackId |  start_time    |  end_time   |      lead    |
+-----+--------------------------------------------------+
|  1       | 12:00:00       |   12:04:00  |     12:05:00 |
+----------+---------------------------------------------+
|  1       | 12:05:00       |   12:08:00  |    12:20:00  |  
+----------+---------------------------------------------+
|  1       | 12:20:00       |   12:22:00  |     null     | 
+----------+---------------------------------------------+
|  2       | 13:00:00       |   13:04:00  |    13:05:00 |
+----------+---------------------------------------------+
|  2       | 13:05:00       |   13:08:00  |    13:20:00  |  
+----------+---------------------------------------------+
|  2       | 13:20:00       |   13:22:00  |     null     | 
+----------+---------------------------------------------+

any help how to do that ?

219

asked May 01 '18 08:05

sandevfares

1 Answers

All you are missing is the Window keyword and partitionBy method call

import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
dataset.withColumn("lead",lead(col("start_time"),1).over(Window.partitionBy("trackId").orderBy("start_time")))

111

answered Nov 10 '22 14:11

Ramesh Maharjan

Related questions
                            
                                Scala-Spark Dynamically call groupby and agg with parameter values
                            
                                How to count number of occurrences by using pyspark
                            
                                How to install Apache Toree for Spark Kernel in Jupyter in (ana)conda environment?
                            
                                Spark random forest binary classifier metrics
                            
                                Spark History Server on S3A FileSystem: ClassNotFoundException
                            
                                Hive on Spark list all partitions for specific hive table and adding a partition
                            
                                value read is not a member of org.apache.spark.SparkContext
                            
                                scala.MatchError: [Ljava.lang.String; (of class [Ljava.lang.String;)
                            
                                Inserting Data Into Cassandra table Using Spark DataFrame
                            
                                foreach function not working in Spark DataFrame
                            
                                Dropping columns by data type in Scala Spark
                            
                                Spark: unpersist RDDs for which I have lost the reference
                            
                                Redirect Spark console logs into a file
                            
                                How to expire state of dropDuplicates in structured streaming to avoid OOM?
                            
                                Workaround for importing spark implicits everywhere
                            
                                spark-submit Error: No main class set in JAR; please specify one with --class
                            
                                java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()V
                            
                                Does Kryo help in SparkSQL?
                            
                                StackOverflowError when operating with a large number of columns in Spark
                            
                                How to write a Dataset to Kafka topic?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to use spark lag and lead over group by and order by

Tags:

apache-spark

apache-spark-sql

apache-spark-dataset

sandevfares

People also ask

1 Answers

Ramesh Maharjan

Recent Activity

Donate For Us