Here is my problem: Each night, I have to process around 50k Background Jobs, each taking an average of 60s. Those jobs are basically calling the Facebook, Instagram and Twitter APIs to collect users' posts and save them in my DB. The jobs are processed by sidekiq.
At first, my setup was:
:concurrency: 5
in sidekiq.yml
pool: 5
in my database.yml
RAILS_MAX_THREADS
set to 5 in my Web Server (puma) configuration.
My understanding is:
my web server (rails s
) will use max 5 threads hence max 5 connections to my DB, which is OK as the connection pool is set to 5.
my sidekiq process will use 5 threads (as the concurrency is set to 5), which is also OK as the connection pool is set to 5.
In order to process more jobs in the same time and reducing the global time to process all my jobs, I decided to increase the sidekiq concurrency to 25. In Production, I provisionned a Heroku Postgres Standard Database with a maximum connection of 120, to be sure I will be able to use Sidekiq concurrency.
Thus, now the setup is:
:concurrency: 25
in sidekiq.yml
pool: 25
in my database.yml
RAILS_MAX_THREADS
set to 5 in my Web Server (puma) configuration.
I can see that 25 sidekiq workers are working but each Job is taking way more time (sometimes more than 40 minutes instead of 1 minute) !?
Actually, I've been doing some tests and realize that processing 50 of my Jobs with a sidekiq concurrency of 5, 10 or 25 result in the same duration. As if somehow, there was a bottleneck of 5 connections somewhere.
I have checked Sidekiq Documentation and some other posts on SO (sidekiq - Is concurrency > 50 stable?, Scaling sidekiq network archetecture: concurrency vs processes) but I haven't been able to solve my problem.
So I am wondering:
is my understanding of the rails database.yml
connection pool
and sidekiq concurrency
right ?
What's the correct way to setup those parameters ?
Sidekiq handles concurrency by using multiple threads in its process. This way, it can process multiple jobs at once, each thread processing one job at a time. By default, Sidekiq uses 10 threads per process. You can configure it to use more threads, thus increasing concurrency.
Note: Sidekiq needs 2 connections by default for a bunch of stuff like a heartbeat, maintain sentinel, etc.
The connection pool is an object in the Physical layer that describes access to the data source. It contains information about the connection between the Oracle BI Server and that data source. The Physical layer in the Administration Tool contains at least one connection pool for each database.
Dropping this here in case someone else could use a quick, very general pointer:
Sometimes increasing the number of concurrent workers may not yield the expected results.
For instance, if there's a large discrepancy between the number of tasks and the number of cores, the scheduler will keep switching your tasks and there isn't really much to gain, the jobs will just take about the same or a bit more time.
Here's a link to a rather interesting read on how job scheduling works https://en.wikipedia.org/wiki/Scheduling_(computing)#Operating_system_process_scheduler_implementations
There are other aspects to consider as well, such as datastore access, are your workers using the same table(s)? Is it backed by a storage engine that locks the entire table, such as MyISAM? If that's the case, it won't matter if you have 100 workers running at the same time, and enough RAM and cores, they will all be waiting in line for whichever query is running to release the lock on the table they're all meant to be working with. This can also happen with tables using engines such as InnoDB, which doesn't lock the entire table on write but you may have different workers accessing the same rows (InnoDB uses row-level locking) or simply some large indexes that don't lock but slow down the table.
Another issue I've encountered was related to Rails (which I'm assuming you're using) taking quite a toll on RAM in some cases, so you might want to look at your memory footprint as well.
My suggestion is to turn on logging and look at the data, where do your workers spend most time at? Is it something on the network layer (unlikely), is it waiting to get access to a core? Reading/writing from your data store? Is your machine swapping?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With