In-order processing in Spark Streaming

2 Answers

You could force the RDD to be a single partition, which removes any parallelism.

181

answered Oct 17 '22 20:10

Holden

"Our use case is reading events from Kafka, where each topic needs to be processed in order. "

As per my understanding, each topic forms separata Dstreams. So you should be process each Dstreams one after another.

But most likely you mean you want to process each events your are getting from 1 Kafka topic in order. In that case, you should not depend on ordering of record in a RDD, rather you should tag each record with the timestamp when you first see them (probably way upstream) and use this timestamp to order later on.

You have other choices, which are bad :)

As Holden suggests, put everything in one partition
Partition with some increasing function based on receiving time, so you fill up partitions one after another. Then you can use zipWithIndex reliably.

answered Oct 17 '22 21:10

ayan guha

Related questions
                            
                                How to use "cube" only for specific fields on Spark dataframe?
                            
                                Spark: graphx api OOM errors after unpersist useless RDDs
                            
                                How does back pressure property work in Spark Streaming?
                            
                                Spark Shell with Yarn - Error: Yarn application has already ended! It might have been killed or unable to launch application master
                            
                                How to split comma separated string and get n values in Spark Scala dataframe?
                            
                                How to connect with JMX remotely to Spark worker on Dataproc
                            
                                how to write spark custom data source based on FileFormat
                            
                                What causes "unknown resolver null" in Spark Kafka Connector?
                            
                                Is manually managing memory with .unpersist() a good idea?
                            
                                maxCategories not working as expected in VectorIndexer when using RandomForestClassifier in pyspark.ml
                            
                                Read Zstandard-compressed file in Spark 2.3.0
                            
                                java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V
                            
                                unable to download the pipeline provided by spark-nlp library
                            
                                Getting the leaf probabilities of a tree model in spark
                            
                                PySpark equivalent of function "typedLit" from Scala API
                            
                                Spark streaming reads file twice from NFS
                            
                                NotSerializableException when sorting in Spark
                            
                                How to score all user-product combinations in Spark MatrixFactorizationModel?
                            
                                Resources/Documentation on how does the failover process work for the Spark Driver (and its YARN Container) in yarn-cluster mode
                            
                                Spark can't pickle method_descriptor

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

In-order processing in Spark Streaming

Tags:

apache-spark

spark-streaming

EugeneMi

People also ask

2 Answers

Holden

ayan guha

Recent Activity

Donate For Us