With Python's multiprocessing, would it make sense to have a Pool
with a bunch of ThreadPool
s within them? Say I have something like:
def task(path):
# i/o bound
image = load(path)
# cpu bound but only takes up 1/10 of the time of the i/o bound stuff
image = preprocess(img)
# i/o bound
save(image, path)
Then I'd want to process a list of paths path_list
. If I use ThreadPool
I still end up hitting a ceiling because of the cpu bound bit. If I use a Pool
I spend too much dead time waiting for i/o. So wouldn't it be best to split path_list
over multiple processes that each in turn use multiple threads?
Another shorter way of restating my example is what if I have a method that should be multithreaded because it's i/o bound but I also want to make use of many cpu cores? If I use a Pool
I'm using each core up for a single task which is i/o bound. If I use a ThreadPool
I only get to use one core.
Yes. Let's say you start with one process and one thread. Because some parts of the code block on IO, the process will utilize less than a 100% CPU - so we start adding threads. As long as we see an increase in task throughput, it means the CPU is our bottleneck. At some point, we might hit 100% CPU utilization in our process. Because of the GIL, a pure python process can utilize up to 100% CPU. But, as far as we know, the CPU might still be our bottleneck, and the only way to gain more CPU time is to create another process (or use subinterpreters, but let's ignore that for now).
In summary, this is a valid approach for increasing throughput of pure-python tasks that both utilize CPU and block on IO. But, it does not mean that it is a good approach in your case. First, your bottleneck might be the disk and not the CPU, in which case you don't need more CPU time, which means you don't need more processes. Second, even if the CPU is the bottleneck, multithreading within multiprocessing is not necessarily the simplest solution, the most performant solution, or the winning solution in other resource utilization metrics such as memory usage.
For example, if simplicity is your top priority, you could get all the CPU time you need just by using processes. This solution is easier to implement, but is heavy in terms of memory usage. Or, for example, if your goal is to achieve maximal performance and minimal memory utilization, then you you probably want to replace the threads with an IO loop and use a process pool executor for your CPU-bound tasks. Squeezing maximal performance from your hardware is not an easy task. Below is a methodology that I feel had served me well.
From now on, I'm assuming your goal is to make maximal use of your hardware in order to achieve a maximal throughput of "tasks". In that case, the final solution depends on your hardware, so you'll need to get to know it a little bit better. To try and reach your performance goals, I recommend to:
In detail:
In this case, there are a few pieces of hardware involved:
Let's look at one "task" and note how it uses the hardware:
To identify the bottleneck, let us calculate the maximum throughput of tasks that each hardware component can provide, assuming usage of them can be completely parallelized. I like to do that using python: (note that I'm using random constants, you'll have to fill in the real data for your setup in order to use it).
# ----------- General consts
input_image_size = 20 * 2 ** 20 # 20MB
output_image_size = 15 * 2 ** 20 # 15MB
# ----------- Disk
# If you have multiple disks and disk access is the bottleneck, you could split the images between them
amount_of_disks = 2
disk_read_rate = 3.5 * 2 ** 30 # 3.5GBps, maximum read rate for a good SSD
disk_write_rate = 2.5 * 2 ** 30 # 2.5GBps, maximum write rate for a good SSD
disk_read_throughput = amount_of_disks * disk_read_rate / input_image_size
disk_write_throughput = amount_of_disks * disk_write_rate / output_image_size
# ----------- RAM
ram_bandwidth = 30 * 2 ** 30 # Assuming here similar write and read rates of 30GBps
# assuming you are working in userspace and not using a userspace filesystem,
# data is first read into kernel space, then copied to userspace. So in total,
# two writes and one read.
userspace_ram_bandwidth = ram_bandwidth / 3
ram_read_throughput = userspace_ram_bandwidth / input_image_size
ram_write_throughput = userspace_ram_bandwidth / output_image_size
# ----------- CPU
# We decrease one core, as at least some scheduling code and kernel code is going to run
core_amount = 8 - 1
# The measured amount of times a single core can run the preprocess function in a second.
# Assuming that you are not planning to optimize the preprocess function as well.
preprocess_function_rate = 1000
cpu_throughput = core_amount * preprocess_function_rate
# ----------- Conclusions
min_throughput, bottleneck_name = min([(disk_read_throughput, 'Disk read'),
(disk_write_throughput, 'Disk write'),
(ram_read_throughput, 'RAM read'),
(ram_write_throughput, 'RAM write'),
(cpu_throughput, 'CPU')])
cpu_cores_needed = min_throughput / preprocess_function_rate
print(f'Throughput: {min_throughput:.1f} tasks per second\n'
f'Bottleneck: {bottleneck_name}\n'
f'Worker amount: {cpu_cores_needed:.1f}')
This code outputs:
Throughput: 341.3 tasks per second
Bottleneck: Disk write
Worker amount: 0.3
That means:
ramfs
or a similar solution that avoids using the disk altogethertask
are executed in parallel, you won't need to dedicate more than one core for running preprocess
. (In python that means you'll probably need only one process, and threads or asyncio would suffice to achieve concurrency with other steps)This kind of estimation is very hard to get right. It's hard not to forget things in the calculation itself, and hard to achieve good measurements for the constants. For example, there is a big issue with the current calculation - reads and writes are not orthogonal. We assume in our calculation that everything is happening in parallel, so constants like disk_read_rate
have to account for writes occurring simultaneously to the reads. The RAM rates should probably be decreased by at least 50%.
Similarly to what you'd offered in your question, my initial design would be something like:
The actual implementation details will vary according to different technical constraints and overheads you will run into while implementing the solution. Without further details and measurements it is hard to guess what they will be exactly.
Good luck, and be warned that even if you did a good job at estimating the maximal throughput, it might be very hard to get there. Comparing the maximum rate to your speed requirements might give you a good idea of the amount of effort needed. For example, if the rate you need is 10x slower than the maximum rate, you might be done pretty quickly. But if it is only 2x slower, you might want to consider doubling your hardware and start preparing for some hard work :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With