Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In-order processing in Spark Streaming

Is it possible to enforce in-order processing in Spark Streaming? Our use case is reading events from Kafka, where each topic needs to be processed in order.

From what I can tell it's impossible - each stream in broken into RDDs, and RDDS are processed in parallel, so there is no way to guaranty order.

like image 282
EugeneMi Avatar asked Jun 04 '15 20:06

EugeneMi


People also ask

How does Spark stream processing?

Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.

What are the modes of processing that Spark support?

It provides development APIs in Java, Scala, Python and R, and supports code reuse across multiple workloads—batch processing, interactive queries, real-time analytics, machine learning, and graph processing.

What is the correct flow for Spark Streaming architecture?

Architecture of Spark Streaming: Discretized Streams As we know, continuous operator processes the streaming data one record at a time. Despite, processing one record at a time, it discretizes data into tiny, micro-batches. We can also say, spark streaming's receivers accept data in parallel.

Does Spark support stream processing?

Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively supports both batch and streaming workloads.


2 Answers

You could force the RDD to be a single partition, which removes any parallelism.

like image 181
Holden Avatar answered Oct 17 '22 20:10

Holden


"Our use case is reading events from Kafka, where each topic needs to be processed in order. "

As per my understanding, each topic forms separata Dstreams. So you should be process each Dstreams one after another.

But most likely you mean you want to process each events your are getting from 1 Kafka topic in order. In that case, you should not depend on ordering of record in a RDD, rather you should tag each record with the timestamp when you first see them (probably way upstream) and use this timestamp to order later on.

You have other choices, which are bad :)

  1. As Holden suggests, put everything in one partition
  2. Partition with some increasing function based on receiving time, so you fill up partitions one after another. Then you can use zipWithIndex reliably.
like image 28
ayan guha Avatar answered Oct 17 '22 21:10

ayan guha