Is it possible to enforce in-order processing in Spark Streaming? Our use case is reading events from Kafka, where each topic needs to be processed in order.
From what I can tell it's impossible - each stream in broken into RDDs, and RDDS are processed in parallel, so there is no way to guaranty order.
Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.
It provides development APIs in Java, Scala, Python and R, and supports code reuse across multiple workloads—batch processing, interactive queries, real-time analytics, machine learning, and graph processing.
Architecture of Spark Streaming: Discretized Streams As we know, continuous operator processes the streaming data one record at a time. Despite, processing one record at a time, it discretizes data into tiny, micro-batches. We can also say, spark streaming's receivers accept data in parallel.
Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively supports both batch and streaming workloads.
You could force the RDD to be a single partition, which removes any parallelism.
"Our use case is reading events from Kafka, where each topic needs to be processed in order. "
As per my understanding, each topic forms separata Dstreams. So you should be process each Dstreams one after another.
But most likely you mean you want to process each events your are getting from 1 Kafka topic in order. In that case, you should not depend on ordering of record in a RDD, rather you should tag each record with the timestamp when you first see them (probably way upstream) and use this timestamp to order later on.
You have other choices, which are bad :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With