Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does it make sense to multi-thread within multiprocessing?

With Python's multiprocessing, would it make sense to have a Pool with a bunch of ThreadPools within them? Say I have something like:

def task(path):
  # i/o bound
  image = load(path)
  # cpu bound but only takes up 1/10 of the time of the i/o bound stuff
  image = preprocess(img)
  # i/o bound
  save(image, path)

Then I'd want to process a list of paths path_list. If I use ThreadPool I still end up hitting a ceiling because of the cpu bound bit. If I use a Pool I spend too much dead time waiting for i/o. So wouldn't it be best to split path_list over multiple processes that each in turn use multiple threads?

Another shorter way of restating my example is what if I have a method that should be multithreaded because it's i/o bound but I also want to make use of many cpu cores? If I use a Pool I'm using each core up for a single task which is i/o bound. If I use a ThreadPool I only get to use one core.

like image 258
Alexander Soare Avatar asked Mar 10 '21 14:03

Alexander Soare


1 Answers

Does it make sense

Yes. Let's say you start with one process and one thread. Because some parts of the code block on IO, the process will utilize less than a 100% CPU - so we start adding threads. As long as we see an increase in task throughput, it means the CPU is our bottleneck. At some point, we might hit 100% CPU utilization in our process. Because of the GIL, a pure python process can utilize up to 100% CPU. But, as far as we know, the CPU might still be our bottleneck, and the only way to gain more CPU time is to create another process (or use subinterpreters, but let's ignore that for now).

In summary, this is a valid approach for increasing throughput of pure-python tasks that both utilize CPU and block on IO. But, it does not mean that it is a good approach in your case. First, your bottleneck might be the disk and not the CPU, in which case you don't need more CPU time, which means you don't need more processes. Second, even if the CPU is the bottleneck, multithreading within multiprocessing is not necessarily the simplest solution, the most performant solution, or the winning solution in other resource utilization metrics such as memory usage.

For example, if simplicity is your top priority, you could get all the CPU time you need just by using processes. This solution is easier to implement, but is heavy in terms of memory usage. Or, for example, if your goal is to achieve maximal performance and minimal memory utilization, then you you probably want to replace the threads with an IO loop and use a process pool executor for your CPU-bound tasks. Squeezing maximal performance from your hardware is not an easy task. Below is a methodology that I feel had served me well.

Aiming towards maximal performance

From now on, I'm assuming your goal is to make maximal use of your hardware in order to achieve a maximal throughput of "tasks". In that case, the final solution depends on your hardware, so you'll need to get to know it a little bit better. To try and reach your performance goals, I recommend to:

  1. Understand your hardware utilization
  2. Identify the bottleneck and estimate the maximal throughput
  3. Design a solution to achieve that throughput
  4. Implement the design, and optimize until you meet your requirements

In detail:

1. Understand your hardware utilization

In this case, there are a few pieces of hardware involved:

  • The RAM
  • The disk
  • The CPU

Let's look at one "task" and note how it uses the hardware:

  1. Disk (read)
  2. RAM (write)
  3. CPU time
  4. RAM (read)
  5. Disk (write)

2. Identify the bottleneck and estimate the maximal throughput

To identify the bottleneck, let us calculate the maximum throughput of tasks that each hardware component can provide, assuming usage of them can be completely parallelized. I like to do that using python: (note that I'm using random constants, you'll have to fill in the real data for your setup in order to use it).

# ----------- General consts
input_image_size = 20 * 2 ** 20  # 20MB
output_image_size = 15 * 2 ** 20  # 15MB

# ----------- Disk
# If you have multiple disks and disk access is the bottleneck, you could split the images between them
amount_of_disks = 2
disk_read_rate = 3.5 * 2 ** 30  # 3.5GBps, maximum read rate for a good SSD
disk_write_rate = 2.5 * 2 ** 30  # 2.5GBps, maximum write rate for a good SSD
disk_read_throughput = amount_of_disks * disk_read_rate / input_image_size 
disk_write_throughput = amount_of_disks * disk_write_rate / output_image_size

# ----------- RAM
ram_bandwidth = 30 * 2 ** 30  # Assuming here similar write and read rates of 30GBps
# assuming you are working in userspace and not using a userspace filesystem,
# data is first read into kernel space, then copied to userspace. So in total,
# two writes and one read.
userspace_ram_bandwidth = ram_bandwidth / 3
ram_read_throughput = userspace_ram_bandwidth / input_image_size 
ram_write_throughput = userspace_ram_bandwidth / output_image_size

# ----------- CPU
# We decrease one core, as at least some scheduling code and kernel code is going to run
core_amount = 8 - 1
# The measured amount of times a single core can run the preprocess function in a second.
# Assuming that you are not planning to optimize the preprocess function as well.
preprocess_function_rate = 1000
cpu_throughput = core_amount * preprocess_function_rate

# ----------- Conclusions
min_throughput, bottleneck_name = min([(disk_read_throughput, 'Disk read'),
                                       (disk_write_throughput, 'Disk write'),
                                       (ram_read_throughput, 'RAM read'),
                                       (ram_write_throughput, 'RAM write'),
                                       (cpu_throughput, 'CPU')])
cpu_cores_needed = min_throughput / preprocess_function_rate
print(f'Throughput: {min_throughput:.1f} tasks per second\n'
      f'Bottleneck: {bottleneck_name}\n'
      f'Worker amount: {cpu_cores_needed:.1f}')

This code outputs:

Throughput: 341.3 tasks per second
Bottleneck: Disk write
Worker amount: 0.3

That means:

  • The maximum rate we can achieve is around 341.3 tasks per second
  • The disk is the bottleneck. You might be able to increase your performance by, for example:
    • Buying more disks
    • Using ramfs or a similar solution that avoids using the disk altogether
  • In a system where all the steps in task are executed in parallel, you won't need to dedicate more than one core for running preprocess. (In python that means you'll probably need only one process, and threads or asyncio would suffice to achieve concurrency with other steps)

Note: the numbers are lying

This kind of estimation is very hard to get right. It's hard not to forget things in the calculation itself, and hard to achieve good measurements for the constants. For example, there is a big issue with the current calculation - reads and writes are not orthogonal. We assume in our calculation that everything is happening in parallel, so constants like disk_read_rate have to account for writes occurring simultaneously to the reads. The RAM rates should probably be decreased by at least 50%.

3. Design a solution to achieve that throughput

Similarly to what you'd offered in your question, my initial design would be something like:

  • Have a pool of workers load the images and send them on a queue to the next step (we'll need to be reading using multiple cores to use all available memory bandwidth)
  • Have a pool of workers process the images and send the results on a queue (the amount of workers should be chosen according to the output of the script above. For the current result, the number is 1)
  • Have a pool of workers save the processed images to the disk.

The actual implementation details will vary according to different technical constraints and overheads you will run into while implementing the solution. Without further details and measurements it is hard to guess what they will be exactly.

4. Implement the design, and optimize until you meet your requirements

Good luck, and be warned that even if you did a good job at estimating the maximal throughput, it might be very hard to get there. Comparing the maximum rate to your speed requirements might give you a good idea of the amount of effort needed. For example, if the rate you need is 10x slower than the maximum rate, you might be done pretty quickly. But if it is only 2x slower, you might want to consider doubling your hardware and start preparing for some hard work :)

like image 122
kmaork Avatar answered Oct 26 '22 02:10

kmaork