Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Will storm task state be transferred to new executor after rebalance?

Tags:

apache-storm

This is a question I came up with after reading: What is the "task" in Storm parallelism

If I need to keep some information in the bolt's internal state, for example, in the classical word counting use case, keeping the count of each word seen in the bolt in a hashmap. After executing "rebalance" command, the task of the bolt many be moved to another executor, which may be in another JVM or even another machine. Will the bolt's internal state (the word count hashmap in this example) be transferred to the new environment (instance/JVM/machine)?

Of course putting the word count hashmap in a central place such as Zookeeper won't have this problem. But for performance sake, it seem we need to keep things in memory sometimes.

like image 943
user3513268 Avatar asked Apr 09 '14 01:04

user3513268


People also ask

What is Storm rebalance?

A nifty feature of Storm is that you can increase or decrease the number of worker processes and/or executors without being required to restart the cluster or the topology. The act of doing so is called rebalancing.

How does Apache Storm work?

Apache Storm – released by Twitter, is a distributed open-source framework that helps in the real-time processing of data. Apache Storm works for real-time data just as Hadoop works for batch processing of data (Batch processing is the opposite of real-time.

Why do we need Apache Storm?

Why use Apache Storm? Apache Storm is a free and open source distributed realtime computation system. Apache Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.

What is bolt in Apache Storm?

Bolt represents a node in the topology having the smallest processing logic and the output of a bolt can be emitted into another bolt as input. Storm keeps the topology always running, until you kill the topology. Apache Storm's main job is to run the topology and will run any number of topology at a given time.


1 Answers

Once you run a rebalance the following will happen

  1. It will first deactivate the current topology
  2. It will then distribute the workers evenly within the cluster
  3. The topology will then return to its previous state of activation

Here is a comment by Nathan Marz which should help clear your doubts.

Rebalance is equivalent to those workers being killed and being created from scratch on another machine. If you want "state" to be maintained, I suggest you use something like Trident and keep your state synced on a DFS

like image 199
user2720864 Avatar answered Oct 26 '22 12:10

user2720864