I am working on a Scala (2.11) / Spark (1.6.1) streaming project and using <code>mapWithState()</code> to keep track of seen data from previous batches. The state is distributed in 20 partitions on multiple nodes, created with <code>StateSpec.function(trackStateFunc _).numPartitions(20)</code>. In this state we have only a few keys (~100) mapped to <code>Sets</code> with up ~160.000 entries, which grow throughout the application. The entire state is up to <code>3GB</code>, which can be handled by each node in the cluster. In each batch, some data is added to a state but not deleted until the very end of the process, i.e. ~15 minutes. While following the application UI, every 10th batch's processing time is very high compared to the other batches. See images: <img src="https://i.stack.imgur.com/BfeoA.png" alt="The spikes show the higher processing time."> The yellow fields represent the high processing time. <img src="https://i.stack.imgur.com/HdpQn.png" alt="enter image description here"> A more detailed Job view shows that in these batches occur at a certain point, exactly when all 20 partitions are "skipped". Or this is what the UI says. <img src="https://i.stack.imgur.com/OLeQX.png" alt="enter image description here"> My understanding of <code>skipped</code> is that each state partition is one possible task which isn't executed, as it doesn't need to be recomputed. However, I don't understand why the amount of <code>skips</code> varies in each Job and why the last Job requires so much processing. The higher processing time occurs regardless of the state's size, it just impacts the duration. Is this a bug in the <code>mapWithState()</code> functionality or is this intended behaviour? Does the underlying data structure require some kind of reshuffling, does the <code>Set</code> in the state need to copy data? Or is it more likely to be a flaw in my application?

<blockquote> Is this a bug in the mapWithState() functionality or is this intended behaviour? </blockquote> This is intended behavior. The spikes you're seeing is because your data is getting checkpointed at the end of that given batch. If you'll notice the time on the longer batches, you'll see that it happens persistently every 100 seconds. That's because the checkpoint time is constant, and is calculated per your <code>batchDuration</code>, which is how often you talk to your data source to read a batch multiplied by some constant, unless you explicitly set the <code>DStream.checkpoint</code> interval. Here is the relevant piece of code from <code>MapWithStateDStream</code>: <pre class="prettyprint"><code>override def initialize(time: Time): Unit = { if (checkpointDuration == null) { checkpointDuration = slideDuration * DEFAULT_CHECKPOINT_DURATION_MULTIPLIER } super.initialize(time) } </code></pre> Where <code>DEFAULT_CHECKPOINT_DURATION_MULTIPLIER</code> is: <pre class="prettyprint"><code>private[streaming] object InternalMapWithStateDStream { private val DEFAULT_CHECKPOINT_DURATION_MULTIPLIER = 10 } </code></pre> Which lines up exactly with the behavior you're seeing, since your read batch duration is every 10 seconds => 10 * 10 = 100 seconds. This is normal, and that is the cost of persisting state with Spark. An optimization on your side could be to think how you can minimize the size of the state you have to keep in memory, in order for this serialization to be as quick as possible. Additionaly, make sure that the data is spread out throughout enough executors, so that state is distributed uniformly between all nodes. Also, I hope you've turned on Kryo Serialization instead of the default Java serialization, that can give you a meaningful performance boost.

Spark Streaming mapWithState seems to rebuild complete state periodically

Tags:

scala

apache-spark

spark-streaming

I am working on a Scala (2.11) / Spark (1.6.1) streaming project and using mapWithState() to keep track of seen data from previous batches.

The state is distributed in 20 partitions on multiple nodes, created with StateSpec.function(trackStateFunc _).numPartitions(20). In this state we have only a few keys (~100) mapped to Sets with up ~160.000 entries, which grow throughout the application. The entire state is up to 3GB, which can be handled by each node in the cluster. In each batch, some data is added to a state but not deleted until the very end of the process, i.e. ~15 minutes.

While following the application UI, every 10th batch's processing time is very high compared to the other batches. See images:

The spikes show the higher processing time.

The yellow fields represent the high processing time.

enter image description here

A more detailed Job view shows that in these batches occur at a certain point, exactly when all 20 partitions are "skipped". Or this is what the UI says.

enter image description here

My understanding of skipped is that each state partition is one possible task which isn't executed, as it doesn't need to be recomputed. However, I don't understand why the amount of skips varies in each Job and why the last Job requires so much processing. The higher processing time occurs regardless of the state's size, it just impacts the duration.

Is this a bug in the mapWithState() functionality or is this intended behaviour? Does the underlying data structure require some kind of reshuffling, does the Set in the state need to copy data? Or is it more likely to be a flaw in my application?

339

asked Mar 16 '16 17:03

Lawrence Benson

1 Answers

Is this a bug in the mapWithState() functionality or is this intended behaviour?

This is intended behavior. The spikes you're seeing is because your data is getting checkpointed at the end of that given batch. If you'll notice the time on the longer batches, you'll see that it happens persistently every 100 seconds. That's because the checkpoint time is constant, and is calculated per your batchDuration, which is how often you talk to your data source to read a batch multiplied by some constant, unless you explicitly set the DStream.checkpoint interval.

Here is the relevant piece of code from MapWithStateDStream:

override def initialize(time: Time): Unit = {
  if (checkpointDuration == null) {
    checkpointDuration = slideDuration * DEFAULT_CHECKPOINT_DURATION_MULTIPLIER
  }
  super.initialize(time)
}

Where DEFAULT_CHECKPOINT_DURATION_MULTIPLIER is:

private[streaming] object InternalMapWithStateDStream {
  private val DEFAULT_CHECKPOINT_DURATION_MULTIPLIER = 10
}

Which lines up exactly with the behavior you're seeing, since your read batch duration is every 10 seconds => 10 * 10 = 100 seconds.

This is normal, and that is the cost of persisting state with Spark. An optimization on your side could be to think how you can minimize the size of the state you have to keep in memory, in order for this serialization to be as quick as possible. Additionaly, make sure that the data is spread out throughout enough executors, so that state is distributed uniformly between all nodes. Also, I hope you've turned on Kryo Serialization instead of the default Java serialization, that can give you a meaningful performance boost.

answered Nov 08 '22 22:11

Yuval Itzchakov

Related questions
                            
                                Scala Eclipse Autocomplete Broken?
                            
                                Is there a way to make the Scala REPL not stop with CTRL-C
                            
                                Setting up Scaladoc for IntelliJ
                            
                                Using Typesafe Config's ConfigFactory to set key setting in build.sbt?
                            
                                When extending a trait within a trait, what does 'super' refer to?
                            
                                Spark structured streaming - join static dataset with streaming dataset
                            
                                Minimal framework in Scala for collections with inheriting return type
                            
                                Message delivery sequence in akka actors
                            
                                What are the weaknesses in using Immutability + Actor model for concurrency programming?
                            
                                How to fix NoSuchMethodError?
                            
                                Error:scalac: bad option: -P (IntelliJ IDEA)
                            
                                Why does this Scala function compile when the argument does not conform to the type constraint?
                            
                                How to find which Java/Scala thread has locked a file?
                            
                                Does Spark maintain parquet partitioning on read?
                            
                                Why is Option not Traversable?
                            
                                How to implement a Set with a user-defined equality
                            
                                What is the benefit of using Futures over parallel collections in scala?
                            
                                String interpolation and macro: how to get the StringContext and expression locations
                            
                                When using Scala futures, will chained callbacks with the same execution context be optimised into synchronous calls?
                            
                                How to do authentication using Akka HTTP

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With