Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to evenly balance processing many simultaneous tasks?

PROBLEM

Our PROCESSING SERVICE is serving UI, API, and internal clients and listening for commands from Kafka. Few API clients might create a lot of generation tasks (one task is N messages) in a short time. With Kafka, we can't control commands distribution, because each command comes to the partition which is consumed by one processing instance (aka worker). Thus, UI requests could be waiting too long while API requests are processing.

In an ideal implementation, we should handle all tasks evenly, regardless of its size. The capacity of the processing service is distributed among all active tasks. And even if the cluster is heavily loaded, we always understand that the new task that has arrived will be able to start processing almost immediately, at least before the processing of all other tasks ends.enter image description here


SOLUTION

Instead, we want an architecture that looks more like the following diagram, where we have separate queues per combination of customer and endpoint. This architecture gives us much better isolation, as well as the ability to dynamically adjust throughput on a per-customer basis. enter image description here On the side of the producer

  • the task comes from the client
  • immediately create a queue for this task
  • send all messages to this queue

On the side of the consumer

  • in one process, you constantly update the list of queues
  • in other processes, you follow this list and consume for example 1 message from each queue
  • scale consumers

QUESTION

Is there any common solution to such a problem? Using RabbitMQ or any other tooling. Нistorically, we use Kafka on the project, so if there is any approach using - it is amazing, but we can use any technology for the solution.

like image 990
Aleksandr Krivolap Avatar asked Jun 04 '20 11:06

Aleksandr Krivolap


1 Answers

Why not use spark to execute the messages within the task? What I'm thinking is that each worker creates a spark context that then parallelizes the messages. The function that is mapped can be based on which kafka topic the user is consuming. I suspect however your queues might have tasks that contained a mixture of messages, UI, API calls, etc. This will result in a more complex mapping function. If you're not using a standalone cluster and are using YARN or something similar you can change the queueing method that the spark master is using.

like image 78
VanBantam Avatar answered Nov 17 '22 00:11

VanBantam