Spark Streaming - Batch Interval vs Processing time

Tags:

We have a Spark Streaming application running on YARN Cluster.

It receiving messages from Kafka topics.

Actually our Processing time is more than the batch interval.

Batch Interval : 1 Minute
Processing Time : 5 Minutes

I would like to know , what happens if some data is received in between the processing time, will the data available in memory till the processing over. Or it will be overridden in the subsequent data fetching?

We are using Direct Streaming approach to fetch data from Kafka topics.

Should i go with Window based operations? for example if i have Window length as 5 Minutes and Sliding interval as 2 Minutes and Batch Interval as 1 Minute, will it work?? Because we cannot lose any data in our application.

250

asked Feb 07 '17 14:02

Shankar

1 Answers

In the direct streaming approach, data isn't read by a receiver and then dispatched to other workers. What happens is the driver reads the offsets from Kafka, and then sends each partition with a subset of the offsets to be read.

If your workers haven't finished processing the previous job, they won't start processing the next one (unless you explicitly set spark.streaming.concurrentJobs to more than 1). This means that the offsets will be read, but won't actually dispatch to the executors responsible for reading the data, thus there won't be any data lose whatsoever.

What this does mean is that your job is going to infinitely be late and cause massive processing delays, which isn't something you want. As a rule of thumb any Spark jobs processing time should be less than the interval set for that job.

answered Sep 22 '22 16:09

Yuval Itzchakov

Related questions
                            
                                Scala Slick: Never ending stream
                            
                                Why is code coverage zero with Scalatest, Maven and cobertura?
                            
                                dynamically parse a string and return a function in scala using reflection and interpretors
                            
                                Multiple Spark Workers on Single Windows Machine
                            
                                Creating an RDD to collect the results of an iterative calculation
                            
                                How can I use and return Source queue to caller without materializing it?
                            
                                "error: type mismatch" in Spark with same found and required datatypes
                            
                                Grouping list items by comparing them with their neighbors
                            
                                Scala Future vs Thread for a long running task without result
                            
                                Scala MurmurHash3 library not matching Python mmh3 library
                            
                                Using Akka Http for Multiple Bindings
                            
                                How to create schema Array in data frame with spark
                            
                                Scala factory for generic types using the apply method?
                            
                                Transverse a tree like object in a Tail recursive way in scala
                            
                                Scala case class copy constructor with dynamic fields
                            
                                sealed case classes in flow
                            
                                Akka Streams split stream by type
                            
                                shapeless convert case class to HList and skip all option fields
                            
                                Mark existing variable as candiate for implicit method
                            
                                Meaning of `A >: Null`?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark Streaming - Batch Interval vs Processing time

Tags:

scala

apache-kafka

kafka-consumer-api

spark-streaming

Shankar

People also ask

1 Answers

Yuval Itzchakov

Recent Activity

Donate For Us