set spark.streaming.kafka.maxRatePerPartition for createDirectStream

Tags:

I need to increase the input rate per partition for my application and I have use .set("spark.streaming.kafka.maxRatePerPartition",100) for the config. The stream duration is 10s so I expect process 5*100*10=5000 messages for this batch. However, the input rate I received is just about 500. Can You suggest any modifications to increase this rate?

767

asked Dec 07 '16 16:12

innovatism

1 Answers

The stream duration is 10s so I expect process 5*100*10=5000 messages for this batch.

That's not what the setting means. It means "how many elements each partition can have per batch", not per second. I'm going to assume you have 5 partitions, so you're getting 5 * 100 = 500. If you want 5000, set maxRatePerPartition to 1000.

From "Exactly-once Spark Streaming From Apache Kafka" (written by the Cody, the author of the Direct Stream approach, emphasis mine):

For rate limiting, you can use the Spark configuration variable spark.streaming.kafka.maxRatePerPartition to set the maximum number of messages per partition per batch.

Edit:

After @avrs comment, I looked inside the code which defines the max rate. As it turns out, the heuristic is a bit more complex than stated in both the blog post and the docs.

There are two branches. If backpressure is enabled alongside maxRate, then the maxRate is the minimum between the current backpressure rate calculated by the RateEstimator object and maxRate set by the user. If it isn't enabled, it takes the maxRate defined as is.

Now, after selecting the rate it always multiplies by the total batch seconds, effectively making this a rate per second:

if (effectiveRateLimitPerPartition.values.sum > 0) {
  val secsPerBatch = context.graph.batchDuration.milliseconds.toDouble / 1000
  Some(effectiveRateLimitPerPartition.map {
    case (tp, limit) => tp -> (secsPerBatch * limit).toLong
  })
} else {
  None
}

132

answered Sep 21 '22 23:09

Yuval Itzchakov

Related questions
                            
                                Spark Dataframes- Reducing By Key
                            
                                How to reference a dataframe when in an UDF on another dataframe?
                            
                                NullPointerException in org.apache.spark.ml.feature.Tokenizer
                            
                                How to use Scala UDF in PySpark?
                            
                                Scala/Spark dataframes: find the column name corresponding to the max
                            
                                Apache Spark how to append new column from list/array to Spark dataframe
                            
                                Pyspark: Is there an equivalent method to pandas info()?
                            
                                Getting last value of group in Spark
                            
                                How to read streaming data in XML format from Kafka?
                            
                                How to flatten columns of type array of structs (as returned by Spark ML API)?
                            
                                Splitting a column in pyspark
                            
                                Spark: Return empty column if column does not exist in dataframe
                            
                                Apache Spark startsWith in SQL expression
                            
                                Spark AnalysisException when "flattening" DataFrame in Spark SQL
                            
                                Pyspark - Cumulative sum with reset condition
                            
                                How to find the max value of multiple columns?
                            
                                How to set up Zeppelin to work with remote EMR Yarn cluster
                            
                                Spark Convert Data Frame Column to dense Vector for StandardScaler() "Column must be of type org.apache.spark.ml.linalg.VectorUDT"
                            
                                Java Apache Spark: Long transformation chains result in quadratic time
                            
                                Pyspark Dataframe Join using UDF

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

set spark.streaming.kafka.maxRatePerPartition for createDirectStream

Tags:

apache-spark

spark-streaming

innovatism

People also ask

1 Answers

Edit:

Yuval Itzchakov

Recent Activity

Donate For Us