Multithreading degrades GPU performance

Tags:

In my Python application I am using Detectron2 to run prediction on an image and detect the key-points of all the humans in the image.

I want to run the prediction on frames that are streamed to my app live (using aiortc), but I discovered that the predictions time is much worse because it now runs on a new thread (the main thread is occupied with the server).

Running predictions on a thread takes anywhere between 1.5 to 4 seconds, which is a lot.

When running the predictions on the main-thread (without the video streaming part), I get predictions times of less than a second.

My question is why it happens and how can I fix it¿ Why the GPU performance is degraded so drastically when using it from a new thread¿

Notes:

The code is tested in Google Colab with Tesla P100 GPU and the video streaming is emulated by reading frames from a video file.
I calculate the time it takes to run prediction on a frame using the code in this question.

I tried switching to multiprocessing instead, but couldn't make it work with cuda (I tried both import multiprocessing as well as import torch.multiprocessing with set_stratup_method('spawn')) it just gets stuck when calling start on the process.

Example code:

from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg

import threading
from typing import List
import numpy as np
import timeit
import cv2

# Prepare the configuration file
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-Keypoints/keypoint_rcnn_R_50_FPN_3x.yaml"))
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.7  # set threshold for this model
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Keypoints/keypoint_rcnn_R_50_FPN_3x.yaml")

cfg.MODEL.DEVICE = "cuda"
predictor = DefaultPredictor(cfg)


def get_frames(video: cv2.VideoCapture):
    frames = list()
    while True:
        has_frame, frame = video.read()
        if not has_frame:
            break
        frames.append(frame)
    return frames

class CodeTimer:
    # Source: https://stackoverflow.com/a/52749808/9977758
    def __init__(self, name=None):
        self.name = " '" + name + "'" if name else ''

    def __enter__(self):
        self.start = timeit.default_timer()

    def __exit__(self, exc_type, exc_value, traceback):
        self.took = (timeit.default_timer() - self.start) * 1000.0
        print('Code block' + self.name + ' took: ' + str(self.took) + ' ms')

video = cv2.VideoCapture('DemoVideo.mp4')
num_frames = round(video.get(cv2.CAP_PROP_FRAME_COUNT))
frames_buffer = list()
predictions = list()

def send_frames():
    # This function emulates the stream, so here we "get" a frame and add it to our buffer
    for frame in get_frames(video):
        frames_buffer.append(frame)
        # Simulate delays between frames
        time.sleep(random.uniform(0.3, 2.1))

def predict_frames():
    predicted_frames = 0  # The number of frames predicted so far
    while predicted_frames < num_frames:  # Stop after we predicted all frames
        buffer_length = len(frames_buffer)
        if buffer_length <= predicted_frames:
            continue  # Wait until we get a new frame

        # Read all the frames from the point we stopped
        for frame in frames_buffer[predicted_frames:]:
            # Measure the prediction time
            with CodeTimer('In stream prediction'):
                predictions.append(predictor(frame))
            predicted_frames += 1


t1 = threading.Thread(target=send_frames)
t1.start()
t2 = threading.Thread(target=predict_frames)
t2.start()
t1.join()
t2.join()

478

asked Sep 27 '21 15:09

SagiZiv

Video Answer

2 Answers

The problem is in: your hardware, your libraries or, in the differences between your example code and the real code.

I implemented your code on an Nvidia Jetson Xavier. I installed all needed libraries using the following commands:

# first create your virtual env
virtualenv -p python3 detectron_gpu
source detectron_gpu/bin/activate

#torch for jetson
wget https://nvidia.box.com/shared/static/p57jwntv436lfrd78inwl7iml6p13fzh.whl -O torch-1.8.0-cp36-cp36m-linux_aarch64.whl
sudo apt-get install python3-pip libopenblas-base libopenmpi-dev 
pip3 install Cython
pip3 install numpy torch-1.8.0-cp36-cp36m-linux_aarch64.whl

# torchvision
pip install 'git+https://github.com/pytorch/[email protected]'

# detectron
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'

# ipython bindings (optional)
pip install ipykernel cloudpickle 

# opencv
pip install opencv-python

After that I run your example script on an example video and received the following output:

Code block 'In stream prediction' took: 2932.241764000537 ms
Code block 'In stream prediction' took: 409.69691300051636 ms
Code block 'In stream prediction' took: 410.03823099981673 ms
Code block 'In stream prediction' took: 409.4023269999525 ms

After the first pass, the detector consistently takes around 400ms to run the detection. Which seems about right for an Jetson Xavier. I do not experience the slowdown you described.

I have to note that the Jetson is a specific piece of hardware. In this hardware the RAM memory is shared between the CPU and the GPU. Therefore I do not have to transfer the data from CPU to GPU. So if your slow down is caused by the transfer between CPU and GPU memory, I will not experience this problem in my setup.

143

answered Nov 09 '22 07:11

Thijs Ruigrok

Python threads rely on the GIL which must be locked by all C bindings trying to access Python objects. GPU computing libraries typically use C bindings, and could potentially lock the GIL from time to time and thus pause Python code execution.

It is a wild guess, but this is possible that the predictor function, which needs to go through C and a lock of the GIL finds itself waiting for the other threads that are writing the video buffers. Then depending on how the computation is broken down and how Python juggles with your other thread, I suppose the impact on performance may become visible.

You may:

avoid multi-threading by performing the reading and the prediction in the same thread.
use multi-processing so that the GIL does not interfere between the two processes
code this in a native language such as C, C++...

answered Nov 09 '22 07:11

Victor Paléologue

Related questions
                            
                                Create List With Numbers Getting Greater Each Time Python [closed]
                            
                                Does Python have something like Perl 5.10's "state" variables?
                            
                                A more pythonic way to write this expression?
                            
                                Why doesn't this obvious infinite recursion give a compiler warning? [closed]
                            
                                main method in Python
                            
                                if __name__ == '__main__' python
                            
                                Why does Ruby have Rails while Python has no central framework?
                            
                                Tired of ASP.NET, which of the following should I learn and why?
                            
                                Is there a way around coding in Python without the tab, indent & whitespace criteria?
                            
                                Sending messages from synchronous thread in a Python Discord bot
                            
                                non-destructive version of pop() for a dictionary
                            
                                subprocess call on python gives Error "Bad file descriptor"
                            
                                Discord.py: How to fix "event loop is closed"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Multithreading degrades GPU performance

Tags:

performance

python

multithreading

gpu