I created a dataframe in Spark, by groupby column1 and date and calculated the amount. <pre class="prettyprint"><code>val table = df1.groupBy($"column1",$"date").sum("amount") </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>Column1 |Date |Amount A |1-jul |1000 A |1-june |2000 A |1-May |2000 A |1-dec |3000 A |1-Nov |2000 B |1-jul |100 B |1-june |300 B |1-May |400 B |1-dec |300 </code></pre> Now, I want to add new column, with difference between amount of any two dates from the table.

You can use <code>Window</code> function if the calculation is fixed as calculating difference between previous months, or calculating between previous two months ... etc. For that you can use <code>lag</code> and <code>lead</code> function with <code>Window</code>. But for that you need to change the date column as below so that it can be ordered. <pre class="prettyprint"><code>+-------+------+--------------+------+ |Column1|Date |Date_Converted|Amount| +-------+------+--------------+------+ |A |1-jul |2017-07-01 |1000 | |A |1-june|2017-06-01 |2000 | |A |1-May |2017-05-01 |2000 | |A |1-dec |2017-12-01 |3000 | |A |1-Nov |2017-11-01 |2000 | |B |1-jul |2017-07-01 |100 | |B |1-june|2017-06-01 |300 | |B |1-May |2017-05-01 |400 | |B |1-dec |2017-12-01 |300 | +-------+------+--------------+------+ </code></pre> You can find the difference between previous month and current month by doing <pre class="prettyprint"><code>import org.apache.spark.sql.expressions._ val windowSpec = Window.partitionBy("Column1").orderBy("Date_Converted") import org.apache.spark.sql.functions._ df.withColumn("diff_Amt_With_Prev_Month", $"Amount" - when((lag("Amount", 1).over(windowSpec)).isNull, 0).otherwise(lag("Amount", 1).over(windowSpec))) .show(false) </code></pre> You should have <pre class="prettyprint"><code>+-------+------+--------------+------+------------------------+ |Column1|Date |Date_Converted|Amount|diff_Amt_With_Prev_Month| +-------+------+--------------+------+------------------------+ |B |1-May |2017-05-01 |400 |400.0 | |B |1-june|2017-06-01 |300 |-100.0 | |B |1-jul |2017-07-01 |100 |-200.0 | |B |1-dec |2017-12-01 |300 |200.0 | |A |1-May |2017-05-01 |2000 |2000.0 | |A |1-june|2017-06-01 |2000 |0.0 | |A |1-jul |2017-07-01 |1000 |-1000.0 | |A |1-Nov |2017-11-01 |2000 |1000.0 | |A |1-dec |2017-12-01 |3000 |1000.0 | +-------+------+--------------+------+------------------------+ </code></pre> You can increase the lagging position for previous two months as <pre class="prettyprint"><code>df.withColumn("diff_Amt_With_Prev_two_Month", $"Amount" - when((lag("Amount", 2).over(windowSpec)).isNull, 0).otherwise(lag("Amount", 2).over(windowSpec))) .show(false) </code></pre> which will give you <pre class="prettyprint"><code>+-------+------+--------------+------+----------------------------+ |Column1|Date |Date_Converted|Amount|diff_Amt_With_Prev_two_Month| +-------+------+--------------+------+----------------------------+ |B |1-May |2017-05-01 |400 |400.0 | |B |1-june|2017-06-01 |300 |300.0 | |B |1-jul |2017-07-01 |100 |-300.0 | |B |1-dec |2017-12-01 |300 |0.0 | |A |1-May |2017-05-01 |2000 |2000.0 | |A |1-june|2017-06-01 |2000 |2000.0 | |A |1-jul |2017-07-01 |1000 |-1000.0 | |A |1-Nov |2017-11-01 |2000 |0.0 | |A |1-dec |2017-12-01 |3000 |2000.0 | +-------+------+--------------+------+----------------------------+ </code></pre> I hope the answer is helpful

Difference between two rows in Spark dataframe

Tags:

scala

apache-spark

apache-spark-sql

I created a dataframe in Spark, by groupby column1 and date and calculated the amount.

val table = df1.groupBy($"column1",$"date").sum("amount")

Column1 |Date   |Amount
A   |1-jul  |1000
A   |1-june |2000
A   |1-May  |2000
A   |1-dec  |3000
A   |1-Nov  |2000
B   |1-jul  |100
B   |1-june |300    
B   |1-May  |400
B   |1-dec  |300

Now, I want to add new column, with difference between amount of any two dates from the table.

267

asked Aug 05 '17 23:08

chinkrishna

1 Answers

You can use Window function if the calculation is fixed as calculating difference between previous months, or calculating between previous two months ... etc. For that you can use lag and lead function with Window.

But for that you need to change the date column as below so that it can be ordered.

+-------+------+--------------+------+
|Column1|Date  |Date_Converted|Amount|
+-------+------+--------------+------+
|A      |1-jul |2017-07-01    |1000  |
|A      |1-june|2017-06-01    |2000  |
|A      |1-May |2017-05-01    |2000  |
|A      |1-dec |2017-12-01    |3000  |
|A      |1-Nov |2017-11-01    |2000  |
|B      |1-jul |2017-07-01    |100   |
|B      |1-june|2017-06-01    |300   |
|B      |1-May |2017-05-01    |400   |
|B      |1-dec |2017-12-01    |300   |
+-------+------+--------------+------+

You can find the difference between previous month and current month by doing

import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Column1").orderBy("Date_Converted")
import org.apache.spark.sql.functions._
df.withColumn("diff_Amt_With_Prev_Month", $"Amount" - when((lag("Amount", 1).over(windowSpec)).isNull, 0).otherwise(lag("Amount", 1).over(windowSpec)))
   .show(false)

You should have

+-------+------+--------------+------+------------------------+
|Column1|Date  |Date_Converted|Amount|diff_Amt_With_Prev_Month|
+-------+------+--------------+------+------------------------+
|B      |1-May |2017-05-01    |400   |400.0                   |
|B      |1-june|2017-06-01    |300   |-100.0                  |
|B      |1-jul |2017-07-01    |100   |-200.0                  |
|B      |1-dec |2017-12-01    |300   |200.0                   |
|A      |1-May |2017-05-01    |2000  |2000.0                  |
|A      |1-june|2017-06-01    |2000  |0.0                     |
|A      |1-jul |2017-07-01    |1000  |-1000.0                 |
|A      |1-Nov |2017-11-01    |2000  |1000.0                  |
|A      |1-dec |2017-12-01    |3000  |1000.0                  |
+-------+------+--------------+------+------------------------+

You can increase the lagging position for previous two months as

df.withColumn("diff_Amt_With_Prev_two_Month", $"Amount" - when((lag("Amount", 2).over(windowSpec)).isNull, 0).otherwise(lag("Amount", 2).over(windowSpec)))
  .show(false)

which will give you

+-------+------+--------------+------+----------------------------+
|Column1|Date  |Date_Converted|Amount|diff_Amt_With_Prev_two_Month|
+-------+------+--------------+------+----------------------------+
|B      |1-May |2017-05-01    |400   |400.0                       |
|B      |1-june|2017-06-01    |300   |300.0                       |
|B      |1-jul |2017-07-01    |100   |-300.0                      |
|B      |1-dec |2017-12-01    |300   |0.0                         |
|A      |1-May |2017-05-01    |2000  |2000.0                      |
|A      |1-june|2017-06-01    |2000  |2000.0                      |
|A      |1-jul |2017-07-01    |1000  |-1000.0                     |
|A      |1-Nov |2017-11-01    |2000  |0.0                         |
|A      |1-dec |2017-12-01    |3000  |2000.0                      |
+-------+------+--------------+------+----------------------------+

I hope the answer is helpful

145

answered Nov 03 '22 00:11

Ramesh Maharjan

Related questions
                            
                                Main method is not called in Scala script
                            
                                Why is this Clojure program so slow? How to make it run fast?
                            
                                "Flattening" a List in Scala & Haskell
                            
                                Scala group consecutive elements in list where function is true
                            
                                Idiomatic Scala for Options in place of if/else/else chain
                            
                                Comparing two Byte[]'s for equality in Scala (checking binary image data)
                            
                                How to test an exception case with zio-test
                            
                                Cannot create a tuple containing a null in Scala
                            
                                Scala a better println
                            
                                Play [Scala]: How to flatten a JSON object
                            
                                Adding StringType column to existing Spark DataFrame and then applying default values
                            
                                Scala: Can a literal reference itself?
                            
                                Does Scala allow for this kind of extractor?
                            
                                How to find value closest to List of values?
                            
                                When to use countByValue and when to use map().reduceByKey()
                            
                                Why can't I call methods on a for-yield expression?
                            
                                Is there are way to create method level constants without namespace polution?
                            
                                Scala: Pattern matching Seq[Nothing]
                            
                                scala two options defined, return first, else return second
                            
                                Remove element at given index

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With