Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I/O or CPU bound? How to check if running concurrently?

I’m new to Python and I'm struggling to understand some things in multiprocessing/threading. I want to speed up a function and have been trying different approaches from the multiprocessing module, but I can’t get it to run any faster. It’s possible it won’t run any faster, but I wanted to be sure this is the case before giving up. This isn’t a full description, but the most time-consuming activities are:

-repeatedly generating random data (10,000 rows and 10 columns)

-using a pre-fit model to predict an outcome for each row and

-comparing each predicted value to an initial value.

It performs this multiple times depending on how many of the predicted values equal the initial value, updating the parameters of the distribution each time. The output of the function is a single numeric value.

I want to loop over several of these initial values and end up with a list of the output values. I was hoping to get multiple iterations to run concurrently (but I’m open to anything that could make it faster). I’ve been ignorantly attempting pool.apply, starmap and Process but haven’t seen a change in time.

My questions are:

  1. Based on the description of what I’m doing, is my program I/O or CPU bound? (Is it possible to tell from that? Is this even the right question to be asking?)

  2. Should I be using multithreading or multiprocessing?

  3. How can I determine if the iterations are running concurrently or not?

like image 615
vzste Avatar asked Oct 27 '25 01:10

vzste


1 Answers

Given you didn't mention anything about drives, I'm going to assume it's not very IO bound (although still possible). Are you using multiple threads/processes yet? If not, that's definitely your issue.

I'd probably look at Pythons Thread library and because of the loop to create data, maybe the thread pool library. You just need all of your threads running that rand function at the same time.

EDIT: I forgot to mention. If you open Task Manager/System Monitor, you should be able to see load per CPU/Thread. If only one is maxed at any given time, you aren't concurrent.

Example: I wrote a quick example to help with the thread pool. Your 10,000 item list with 10 columns was not even noticeable on my i7. I increased the columns to 10,000 and it used 4GB of RAM and probably 30 seconds of 100% CPU @ 3.4GHz.

from multiprocessing import Pool, Array
import random


def thread_function(_):
    """Return a random number."""
    l = []
    for _ in range(10000):
        l.append(random.randint(0, 10000))
    return l

if __name__ == '__main__':
    rand_list = Array('i', range(10000))

    with Pool() as pool:
        rand_list = pool.map(thread_function, rand_list)
    print(len(rand_list))
like image 143
Eric Fossum Avatar answered Oct 28 '25 15:10

Eric Fossum