Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the "task" in Storm parallelism

I'm trying to learn twitter storm by following the great article "Understanding the parallelism of a Storm topology"

However I'm a bit confused by the concept of "task". Is a task an running instance of the component(spout or bolt) ? A executor having multiple tasks actually is saying the same component is executed for multiple times by the executor, am I correct ?

Moreover in a general parallelism sense, Storm will spawn a dedicated thread(executor) for a spout or bolt, but what is contributed to the parallelism by an executor(thread) having multiple tasks ? I think having multiple tasks in a thread, since a thread executes sequentially, only make the thread a kind of "cached" resource, which avoids spawning new thread for next task run. Am I correct?

I may clear those confusion by myself after taking more time to investigate, but you know, we both love stackoverflow ;-)

Thanks in advance.

like image 365
John Wang Avatar asked Jun 23 '13 03:06

John Wang


People also ask

How parallelism is achieved in Storm?

Storm offers this and additional finer-grained ways to increase the parallelism of a Storm topology: Increase the number of workers. Increase the number of executors. Increase the number of tasks.

What is Storm topology?

The Storm topology is basically a Thrift structure. TopologyBuilder class provides simple and easy methods to create complex topologies. The TopologyBuilder class has methods to set spout (setSpout) and to set bolt (setBolt). Finally, TopologyBuilder has createTopology to create topology.

Are worker nodes which process the tasks in one or numerous threads in JVM?

Worker Processes are JVM processes. They run one or multiple threads called Executors. Each Executor Thread runs one or multiple Tasks, meaning one or multiple instances of a Bolt or Spout, but those have to be of the same Bolt.

Which of the following method of Storm topology is called when a spout is going to shutdown?

close − This method is called when a spout is going to shutdown. declareOutputFields − Declares the output schema of the tuple. fail − Specifies that a specific tuple is not processed and not to be reprocessed.


1 Answers

Disclaimer: I wrote the article you referenced in your question above.

However I'm a bit confused by the concept of "task". Is a task an running instance of the component(spout or bolt) ? A executor having multiple tasks actually is saying the same component is executed for multiple times by the executor, am I correct ?

Yes, and yes.

Moreover in a general parallelism sense, Storm will spawn a dedicated thread(executor) for a spout or bolt, but what is contributed to the parallelism by an executor(thread) having multiple tasks ?

Running more than one task per executor does not increase the level of parallelism -- an executor always has one thread that it uses for all of its tasks, which means that tasks run serially on an executor.

As I wrote in the article please note that:

  • The number of executor threads can be changed after the topology has been started (see storm rebalance command).
  • The number of tasks of a topology is static.

And by definition there is the invariant of #executors <= #tasks.

So one reason for having 2+ tasks per executor thread is to give you the flexibility to expand/scale up the topology through the storm rebalance command in the future without taking the topology offline. For instance, imagine you start out with a Storm cluster of 15 machines but already know that next week another 10 boxes will be added. Here you could opt for running the topology at the anticipated parallelism level of 25 machines already on the 15 initial boxes (which is of course slower than 25 boxes). Once the additional 10 boxes are integrated you can then storm rebalance the topology to make full use of all 25 boxes without any downtime.

Another reason to run 2+ tasks per executor is for (primarily functional) testing. For instance, if your dev machine or CI server is only powerful enough to run, say, 2 executors alongside all the other stuff running on the machine, you can still run 30 tasks (here: 15 per executor) to see whether code such as your custom Storm grouping is working as expected.

In practice we normally we run 1 task per executor.

PS: Note that Storm will actually spawn a few more threads behind the scenes. For instance, each executor has its own "send thread" that is responsible for handling outgoing tuples. There are also "system-level" background threads for e.g. acking tuples that run alongside "your" threads. IIRC the Storm UI counts those acking threads in addition to "your" threads.

like image 153
Michael G. Noll Avatar answered Oct 19 '22 12:10

Michael G. Noll