Spark mapWithState shuffles all data to one node

Tags:

I am working on a Scala (2.11) / Spark (1.6.1) streaming project and using mapWithState() to keep track of seen data from previous batches.

The state is split in 20 partitions, created with StateSpec.function(trackStateFunc _).numPartitions(20). I had hoped to distribute the state throughout the cluster, but it seems that each node holds the complete state and the execution is always performed only exactly one node.

Locality Level Summary: Node local: 50 is shown in the UI for each batch and the complete batch is Shuffle read. Afterwards, I write to Kafka and the partitions are spread across the cluster again. I can't seem to find out why mapWithState() needs to be run on a single node. Doesn't this ruin the concept of partitioning the state if it is limited by one node instead of the complete cluster? Couldn't it be possible to distribute the state by key?

531

asked Mar 22 '16 10:03

Lawrence Benson

1 Answers

I can't seem to find out why mapWithState needs to be run on a single node

It doesn't. Spark by default uses a HashPartitioner to partition your keys among the different worker nodes in your cluster. If for some reason you're seeing all your data stored on a different node, check the distribution of your keys. If this is a custom object you're using as a key, make sure it's hashCode method is implemented properly. This can happen if something is wrong with the key distribution. If you'd like to test this, try using random numbers as your keys and looking a the Spark UI and seeing if this behavior changes.

I'm running mapWithState and the data coming in is partitioned based on the key, as I also have a reduceByKey method call prior to holding the state, and when looking at the Storage tab on the Spark UI, I can see the different RDD's being stored on different worker nodes in the cluster.

167

answered Oct 07 '22 21:10

Yuval Itzchakov

Related questions
                            
                                Scala futures and threads
                            
                                Is there a way to get predictable actor naming with Akka-Stream?
                            
                                How to pass a function with implicit as parameter
                            
                                SBT Unresolved Dependencies
                            
                                Gatling check for java.io.IOException: Remotely closed
                            
                                How to correctly use Akka's Event Stream?
                            
                                How to install ReactiveMongo on play 2.4?
                            
                                Eclipse shows errors in Play 2.4 setup
                            
                                Batch Size in Spark Streaming
                            
                                Scala Websocket Client?
                            
                                Scala ambiguity with paren-less function calls
                            
                                How to setup IntelliJ to recognize Scala's "???" method as a TODO
                            
                                Surprising equivalences and non-equivalences regarding this.type
                            
                                Scala Play application integration tests with guice context
                            
                                Slick 3.0 how to update variable column list, which number is know only in Runtime
                            
                                Spark JoinWithCassandraTable on TimeStamp partition key STUCK
                            
                                Could not find implicit value for parameter lgen: shapeless.LabelledGeneric.Aux
                            
                                Linking to external Scala API docs in IntelliJ
                            
                                How to use multiple versions of a library in Scala?
                            
                                Does type inference slow down auto-completion in the IDE

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark mapWithState shuffles all data to one node

Tags:

scala

apache-spark

spark-streaming

Lawrence Benson

People also ask

1 Answers

Yuval Itzchakov

Recent Activity

Donate For Us