I am trying to compare record of current and previous row in the below <code>DataFrame</code>. I want to calculate the Amount column. <pre class="prettyprint"><code>scala> val dataset = sc.parallelize(Seq((1, 123, 50), (2, 456, 30), (3, 456, 70), (4, 789, 80))).toDF("SL_NO","ID","AMOUNT") scala> dataset.show +-----+---+------+ |SL_NO| ID|AMOUNT| +-----+---+------+ | 1|123| 50| | 2|456| 30| | 3|456| 70| | 4|789| 80| +-----+---+------+ </code></pre> Calculation Logic: <ol> <li>For the row no 1, AMOUNT should be 50 from first row.</li> <li>For the row no 2, if ID of SL_NO - 2 and 1 is not same then need to consider AMOUNT of SL_NO - 2 (i.e - 30). Otherwise AMOUNT of SL_NO - 1 (i.e. - 50)</li> <li>For the row no 3, if ID of SL_NO - 3 and 2 is not same then need to consider AMOUNT of SL_NO - 3 (i.e - 70). Otherwise AMOUNT of SL_NO - 2 (i.e. - 30)</li> </ol> Same logic need to follow for the other rows also. Expected Output: <pre class="prettyprint"><code>+-----+---+------+ |SL_NO| ID|AMOUNT| +-----+---+------+ | 1|123| 50| | 2|456| 30| | 3|456| 30| | 4|789| 80| +-----+---+------+ </code></pre> Please help.

You could use <code>lag</code> with <code>when.otherwise</code>, here is a demonstration: <pre class="prettyprint"><code>import org.apache.spark.sql.expressions.Window val w = Window.orderBy($"SL_NO") dataset.withColumn("AMOUNT", when($"ID" === lag($"ID", 1).over(w), lag($"AMOUNT", 1).over(w)).otherwise($"AMOUNT") ).show +-----+---+------+ |SL_NO| ID|AMOUNT| +-----+---+------+ | 1|123| 50| | 2|456| 30| | 3|456| 30| | 4|789| 80| +-----+---+------+ </code></pre> Note: since this example doesn't use any partition, it could have performance problem, in your real data, it would be helpful if your problem can be partitioned by some variables, may be <code>Window.orderBy($"SL_NO").partitionBy($"ID")</code> depending on your actual problem and whether IDs are sorted together.

Compare Value of Current and Previous Row in Spark

Tags:

scala

apache-spark

apache-spark-sql

I am trying to compare record of current and previous row in the below DataFrame. I want to calculate the Amount column.

scala> val dataset = sc.parallelize(Seq((1, 123, 50), (2, 456, 30), (3, 456, 70), (4, 789, 80))).toDF("SL_NO","ID","AMOUNT")

scala> dataset.show
+-----+---+------+
|SL_NO| ID|AMOUNT|
+-----+---+------+
|    1|123|    50|
|    2|456|    30|
|    3|456|    70|
|    4|789|    80|
+-----+---+------+

Calculation Logic:

For the row no 1, AMOUNT should be 50 from first row.
For the row no 2, if ID of SL_NO - 2 and 1 is not same then need to consider AMOUNT of SL_NO - 2 (i.e - 30). Otherwise AMOUNT of SL_NO - 1 (i.e. - 50)
For the row no 3, if ID of SL_NO - 3 and 2 is not same then need to consider AMOUNT of SL_NO - 3 (i.e - 70). Otherwise AMOUNT of SL_NO - 2 (i.e. - 30)

Same logic need to follow for the other rows also.

Expected Output:

+-----+---+------+
|SL_NO| ID|AMOUNT|
+-----+---+------+
|    1|123|    50|
|    2|456|    30|
|    3|456|    30|
|    4|789|    80|
+-----+---+------+

Please help.

878

asked Sep 13 '17 12:09

Avijit

1 Answers

You could use lag with when.otherwise, here is a demonstration:

import org.apache.spark.sql.expressions.Window

val w = Window.orderBy($"SL_NO")
dataset.withColumn("AMOUNT", 
    when($"ID" === lag($"ID", 1).over(w), lag($"AMOUNT", 1).over(w)).otherwise($"AMOUNT")
).show

+-----+---+------+
|SL_NO| ID|AMOUNT|
+-----+---+------+
|    1|123|    50|
|    2|456|    30|
|    3|456|    30|
|    4|789|    80|
+-----+---+------+

Note: since this example doesn't use any partition, it could have performance problem, in your real data, it would be helpful if your problem can be partitioned by some variables, may be Window.orderBy($"SL_NO").partitionBy($"ID") depending on your actual problem and whether IDs are sorted together.

121

answered Oct 19 '22 06:10

Psidom

Related questions
                            
                                How to make scalatest matcher to ignore white-spaces when compare two strings?
                            
                                Using shapeless scala to merge the fields of two different case classes
                            
                                read application.conf from build.sbt
                            
                                "Plugin Scala is incompatible with this installation" error with IntelliJ 14
                            
                                SBT 0.13.0 - can't expand macros compiled by previous versions of Scala
                            
                                Is it possible to make an Akka HTTP core client request inside an Actor?
                            
                                How to vectorize DataFrame columns for ML algorithms?
                            
                                mapping over HList inside a function
                            
                                How to sort RDD
                            
                                How to read json data using scala from kafka topic in apache spark
                            
                                How to convert Option[Try[_]] to Try[Option[_]]?
                            
                                += operator in Scala
                            
                                How to use double pipe as delimiter in CSV?
                            
                                Convert current time in milliseconds to Date Time format in Scala
                            
                                Why does overloading polymorphic methods with different upper bounds not compile in Scala
                            
                                How to split multi-value column into separate rows using typed Dataset?
                            
                                Scala Type Classes Best Practices
                            
                                akka stream consume web socket
                            
                                How to get data of previous row in Apache Spark
                            
                                How to remove multiple keys from a Map dynamically

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With