Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to tune the parallelism hint in storm

"parallelism hint" is used in storm to parallelise a running storm topology. I know there are concepts like worker process, executor and tasks. Would it make sense to make the parallelism hint as big as possible so that your topologies are parallelised as much as possible?

My question is How to find a perfect parallelism hint number for my storm topologies. Is it depending on the scale of my storm cluster or it's more like a topology/job specific setting, it varies from one topology to another? or it depends on both?

like image 703
Shengjie Avatar asked Dec 04 '13 09:12

Shengjie


3 Answers

Adding to what @Chiron explained

"parallelism hint" is used in storm to parallelise a running storm topology

Actually in storm the term parallelism hint is used to specify the initial number of executor (threads) of a component (spout, bolt) e.g

    topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2)

The above statement tells storm to allot 2 executor thread initially (this can be changed in the run time). Again

    topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2).setNumTasks(4) 

the setNumTasks(4) indicate to run 4 associated tasks (this will be same throughout the lifetime of a topology). So in this case each storm will be running two tasks per executor. By default, the number of tasks is set to be the same as the number of executors, i.e. Storm will run one task per thread.

Would it make sense to make the parallelism hint as big as possible so that your topologies are parallelised as much as possible

One key thing to note that if you intent to run more than one tasks per executor it does not increase the level of parallelism. Because executor uses one single thread to process all the tasks i.e tasks run serially on an executor.

enter image description here

The purpose of configuring more than 1 task per executor is it is possible to change the number of executor(thread) using the re-balancing mechanism in the runtime (remember the number of tasks are always the same through out the life cycle of a topology) while the topology is still running.

Increasing the number of workers (responsible for running one or more executors for one or more components) might also gives you a performance benefit, but this also relative as I found from this discussion where nathanmarz says

Having more workers might have better performance, depending on where your bottleneck is. Each worker has a single thread that passes tuples on to the 0mq connections for transfer to other workers, so if you're bottlenecked on CPU and each worker is dealing with lots of tuples, more workers will probably net you better throughput.

So basically there is no definite answer to this, you should try different configuration based on your environment and design.

like image 83
user2720864 Avatar answered Sep 21 '22 20:09

user2720864


A good tip to analyse the need for paralelism in your Storm topology is use the metrics from Storm UI:

The Storm UI has also been made significantly more useful. There are new stats "#executed", "execute latency", and "capacity" tracked for all bolts. The "capacity" metric is very useful and tells you what % of the time in the last 10 minutes the bolt spent executing tuples. If this value is close to 1, then the bolt is "at capacity" and is a bottleneck in your topology. The solution to at-capacity bolts is to increase the parallelism of that bolt. (...)

Source: https://storm.incubator.apache.org/2013/01/11/storm082-released.html

like image 25
Vanessa Schissato Avatar answered Sep 20 '22 20:09

Vanessa Schissato


How to find the perfect parallelism hint number? I would say your best bet is try different numbers to find your suitable configuration. Each topology is different.

For example, your topology might interacting with a REST API, RDBMS, Solr, ElasticSearch or whatever and one of those might be your bottle neck. If you increased the parallelism hint, you might bring one of them to its knees and start throwing exceptions or whatever.

Your best bet is try different configuration and tune to find your best parallelism hint.

like image 37
Chiron Avatar answered Sep 21 '22 20:09

Chiron