I'm using Python 3.4.3 and OpenCV 3.0.0 to process (applying various filters to) a very large image (80,000 x 60,000) in memory and I'd like to use multiple CPU cores to improve performance. After some reading, I arrived at two possible method : 1) Use python's multiprocessing
module, let each process deal with a slice of the large image and join the results after processing is done (And this probably should be performed on POSIX system?) 2) Since NumPy supports OpenMP and OpenCV uses NumPy, I can just leave the multiprocessing to NumPy?
So my question is :
Which one will be a better solution? (If they don't seem reasonable, what would be a possible approach? )
If Option 2 is good, should I build both NumPy and OpenCV with OpenMP ? How would I actually make multi-processing happen? ( I couldn't really find useful instruction..)
On the one hand you can build OpenCV with TBB or OpenMP support which parallelize OpenCV's functions internally. On the other hand you can create multiple threads yourself and call the functions parallel to realize multithreading on application level.
The multiprocessing package supports spawning processes. It refers to a function that loads and executes a new child processes. For the child to terminate or to continue executing concurrent computing,then the current process hasto wait using an API, which is similar to threading module.
Multi-Processing in Python using Pool class- It is almost similar to the map-reduce architecture- in essence, it maps the input to different processors and collects the output from all processors as a list. The processes in execution are stored in memory and other non-executing processes are stored out of memory.
Conclusion. If you use OpenCV library beware that it spawns multiple threads for image processing internally. Therefore always explicitly set the number of threads to be spawned by OpenCV based on how many cores your app is going to use in prod, else performance of the app can take a drastic hit.
After reading some SO posts, I've come up with a way to use OpenCV
in Python3 with multiprocessing
. I recommend doing this on linux, because according to this post, spawned processes share memory with their parent as long as the content is not changed. Here's a minimal example:
import cv2
import multiprocessing as mp
import numpy as np
import psutil
img = cv2.imread('test.tiff', cv2.IMREAD_ANYDEPTH) # here I'm using a indexed 16-bit tiff as an example.
num_processes = 4
kernel_size = 11
tile_size = img.shape[0]/num_processes # Assuming img.shape[0] is divisible by 4 in this case
output = mp.Queue()
def mp_filter(x, output):
print(psutil.virtual_memory()) # monitor memory usage
output.put(x, cv2.GaussianBlur(img[img.shape[0]/num_processes*x:img.shape[0]/num_processes*(x+1), :],
(kernel_size, kernel_size), kernel_size/5))
# note that you actually have to process a slightly larger block and leave out the border.
if __name__ == 'main':
processes = [mp.Process(target=mp_filter, args=(x, output)) for x in range(num_processes)]
for p in processes:
p.start()
result = []
for ii in range(num_processes):
result.append(output.get(True))
for p in processes:
p.join()
Instead of using Queue
, another way to collect the result from the processes is to create a shared array through multiprocessing
module. (Has to import ctypes
)
result = mp.Array(ctypes.c_uint16, img.shape[0]*img.shape[1], lock = False)
Then each process can write to different portions of the array assuming there is no overlap. Creating a large mp.Array
is surprisingly slow, however. This actually defies the purpose of speeding up the operation. So use it only when the added time is not much when compared with total computation time. This array can be turned into a numpy array by :
result_np = np.frombuffer(result, dtypye=ctypes.c_uint16)
I don't know what types of filters you need, but if it's reasonably simple, you could consider libvips. It's an image processing system for very large images (larger than the amount of memory you have). It came out of a series of EU-funded scientific art imaging projects, so the focus is on the types of operation you need for image capture and comparison: convolution, rank, morphology, arithmetic, colour analysis, resampling, histograms, and so on.
It's fast (faster than OpenCV, on some benchmarks at least), needs little memory, and there's a high-level Python binding. It works on Linux, OS X and Windows. It handles all the multiprocessing for you automatically.
This can be done cleanly with Ray, which is a library for parallel and distributed Python. Ray reasons about "tasks" instead of using a fork-join model, which gives some additional flexibility (e.g., you an put values in shared memory even after forking worker processes), the same code runs on multiple machines, you can compose tasks together, etc.
import cv2
import numpy as np
import ray
num_tasks = 4
kernel_size = 11
@ray.remote
def mp_filter(image, i):
lower = image.shape[0] // num_tasks * i
upper = image.shape[0] // num_tasks * (i + 1)
return cv2.GaussianBlur(image[lower:upper, :],
(kernel_size, kernel_size), kernel_size // 5)
if __name__ == '__main__':
ray.init()
# Load the image and store it once in shared memory.
image = np.random.normal(size=(1000, 1000))
image_id = ray.put(image)
result_ids = [mp_filter.remote(image_id, i) for i in range(num_tasks)]
results = ray.get(result_ids)
Note that you can store more than just numpy arrays in shared memory, you can also benefit if you have Python objects that contain numpy arrays (like dictionaries containing numpy arrays). Under the hood, this uses the Plasma shared-memory object store and the Apache Arrow data layout.
You can read more in the Ray documentation. Note that I'm one of the Ray developers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With