Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark sql window function lag

I am looking at the window slide function for a Spark DataFrame in Scala.

I have a DataFrame with columns Col1, Col2, Col3, date, volume and new_col.

Col1    Col2    Col3    date     volume new_col
                        201601  100.5   
                        201602  120.6   100.5
                        201603  450.2   120.6
                        201604  200.7   450.2
                        201605  121.4   200.7`

Now I want to add a new column with name(new_col) with one row slided down, as shown above.

I tried below option to use the window function.

val windSldBrdrxNrx_df = df.withColumn("Prev_brand_rx", lag("Prev_brand_rx",1))

Do you have any suggestion ?

like image 428
Ramesh Avatar asked Dec 15 '16 06:12

Ramesh


People also ask

What is lag spark SQL?

About LAG function Spark LAG function provides access to a row at a given offset that comes before the current row in the windows. This function can be used in a SELECT statement to compare values in the current row with values in a previous row.

Is lag a window function SQL?

Overview of SQL Server LAG() functionSQL Server LAG() is a window function that provides access to a row at a specified physical offset which comes before the current row.

Does spark SQL support window functions?

Spark SQL supports three kinds of window functions: ranking functions. analytic functions. aggregate functions.

How does window function work in spark?

Spark Window functions are used to calculate results such as the rank, row number e.t.c over a range of input rows and these are available to you by importing org. apache.


1 Answers

You are doing correctly all you missed is over(window expression) on lag

val df = sc.parallelize(Seq((201601, 100.5),
  (201602, 120.6),
  (201603, 450.2),
  (201604, 200.7),
  (201605, 121.4))).toDF("date", "volume")

val w = org.apache.spark.sql.expressions.Window.orderBy("date")  

import org.apache.spark.sql.functions.lag

val leadDf = df.withColumn("new_col", lag("volume", 1, 0).over(w))

leadDf.show()

+------+------+-------+
|  date|volume|new_col|
+------+------+-------+
|201601| 100.5|    0.0|
|201602| 120.6|  100.5|
|201603| 450.2|  120.6|
|201604| 200.7|  450.2|
|201605| 121.4|  200.7|
+------+------+-------+

This code was run on Spark shell 2.0.2

like image 51
mrsrinivas Avatar answered Nov 10 '22 13:11

mrsrinivas