Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to share a set of data between several processes?

We need to set up a system where multiple processes are working on the same dataset. The idea is to have a set of elements (i.e. no repeated values) that can get pulled by our worker processes (asynchronously). The processes may be distributed on several servers, so we need a distributed solution.

Currently, the pattern we are thinking of is using Redis to hold a set, which holds the working data. Each process should connect to the set, and pop a value from it. The random functionality of spop is actually a plus to us, since we need randomized access to the elements in the set. The data would have to be populated from our main PostgreSQL database.

Like I said, we also have a PostgreSQL database available to query, which the processes could access when requesting elements. However, we don't know if under heavy loads that could be a bottleneck. We do expect heavy - to very heavy concurrent access (think hundreds or even thousands of processes) on this subsystem.

In case it bears any relevance to this, we are using Python with rQ to handle asynchronous tasks (jobs and workers).

Edit: in terms of size, elements can be expected to not be very large - top size should be around 500 - 1000 bytes. They are basically URLs, so unless something strange happens they should be well below that size. The number of elements will be dependent on the number of concurrent processes, so probably about 10 - 50 K elements would be a good ballpark. Bear in mind that this is more of a staging area of sorts, so focus should be more on speed than on size.

My questions, in summary, are:

  1. Is a Redis set a good idea for shared access when using multiple processes? Is there any data that will let us know how that solution will scale? If so, can you provide any pointers or advice?

  2. When populating the shared data, what would be a good update strategy?

Thank you very much!

like image 274
Juan Carlos Coto Avatar asked Dec 31 '12 17:12

Juan Carlos Coto


1 Answers

Not a full answer, just some thoughts: Like it was said, Redis maintains your set in memory, so in order to answer 1 you need to think about or at least estimate a worst case scenario on:

  • how much memory space do you need for each element of the set
  • how many (quantity) elements are a very heavy load

Once you have an estimate you can calculate and see if it is feasible to use Redis:

For instance, having elements of 100 bytes and expecting a "very heavy" load of 1.000.000 elements, you will need at least 100MB of memory just for Redis and it is feasible to use it and even cheap.But if you need 500 bytes per element and your heavy load means 30.000.000 elements then you need 15GB of memory, and it's even doable but maybe too expensive compared to use your postgre db, what leads to the second estimate you need to have:

  • How many requests/second (in total) you will have against your Redis/Postgre server, or how many processes you expect to be making requests and how many/second each process will make.

Having some estimates can help you decide what solution is the best for your requirements/budget.

like image 76
Sergio Ayestarán Avatar answered Nov 15 '22 20:11

Sergio Ayestarán