How to calculate lag difference in Spark Structured Streaming?

Tags:

I am writing a Spark Structured Streaming program. I need to create an additional column with the lag difference.

To reproduce my issue, I provide the code snippet. This code consumes data.json file stored in data folder:

[
  {"id": 77,"type": "person","timestamp": 1532609003},
  {"id": 77,"type": "person","timestamp": 1532609005},
  {"id": 78,"type": "crane","timestamp": 1532609005}
]

Code:

from pyspark.sql import SparkSession
import pyspark.sql.functions as func
from pyspark.sql.window import Window
from pyspark.sql.types import *

spark = SparkSession \
    .builder \
    .appName("Test") \
    .master("local[2]") \
    .getOrCreate()

schema = StructType([
    StructField("id", IntegerType()),
    StructField("type", StringType()),
    StructField("timestamp", LongType())
])

ds = spark \
    .readStream \
    .format("json") \
    .schema(schema) \
    .load("data/")

diff_window = Window.partitionBy("id").orderBy("timestamp")
ds = ds.withColumn("prev_timestamp", func.lag(ds.timestamp).over(diff_window))

query = ds \
    .writeStream \
    .format('console') \
    .start()

query.awaitTermination()

I get this error:

pyspark.sql.utils.AnalysisException: u'Non-time-based windows are not supported on streaming DataFrames/Datasets;;\nWindow [lag(timestamp#71L, 1, null) windowspecdefinition(host_id#68, timestamp#71L ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS prev_timestamp#129L]

512

asked Nov 23 '18 16:11

Mozimaki

1 Answers

pyspark.sql.utils.AnalysisException: u'Non-time-based windows are not supported on streaming DataFrames/Datasets

Meaning that your window should be based on a timestamp column. So it you have a data point for each second, and you make a 30s window with a stride of 10s, your resultant window would create a new window column, with start and end columns which will contain timestamps with a difference of 30s.

You should use the window in this way:

words = words.withColumn('date_time', F.col('date_time').cast('timestamp'))

w = F.window('date_time', '30 seconds', '10 seconds')
words = words \
   .withWatermark('date_format', '1 minutes') \
   .groupBy(w).agg(F.mean('value'))

146

answered Sep 21 '22 21:09

pissall

Related questions
                            
                                Making spark use /etc/hosts file for binding in YARN cluster mode
                            
                                Spark serialization error mystery
                            
                                Spark: More Efficient Aggregation to join strings from different rows
                            
                                Spark SQL performance: version 1.6 vs version 1.5
                            
                                What's the limit to spark streaming in terms of data amount?
                            
                                Jupyter & PySpark: How to run multiple notebooks
                            
                                how to read and write to the same file in spark using parquet?
                            
                                Writing From Spark to DynamoDB
                            
                                Is there a Spark SQL jdbc driver?
                            
                                Why is it possible to have "serialized results of n tasks (XXXX MB)" be greater than `spark.driver.memory` in pyspark?
                            
                                Spark - No FileSystem for scheme: https, cannot load files from Amazon S3
                            
                                Jupyter Notebook only runs locally on Spark
                            
                                Monitoring the Memory Usage of Spark Jobs
                            
                                java.lang.String is not a valid external type for schema of string
                            
                                How can you update a pyfile in the middle of a PySpark shell session?
                            
                                Convert spark dataframe to sparklyR table "tbl_spark"
                            
                                spark job keep showing TaskCommitDenied (Driver denied task commit)
                            
                                MultiLabelBinarizer in Spark?
                            
                                Py4JError when writing Spark DataFrame to Parquet
                            
                                Child thread not seeing updates made by main thread

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to calculate lag difference in Spark Structured Streaming?

Tags:

apache-spark

apache-spark-sql

pyspark

spark-structured-streaming

Mozimaki

People also ask

1 Answers

pissall

Recent Activity

Donate For Us