I want to use Python multiprocessing to run grid search for a predictive model. When I look at core usage, it always seem to be using only one core. Any idea what I'm doing wrong?
import multiprocessing from sklearn import svm import itertools #first read some data #X will be my feature Numpy 2D array #y will be my 1D Numpy array of labels #define the grid C = [0.1, 1] gamma = [0.0] params = [C, gamma] grid = list(itertools.product(*params)) GRID_hx = [] def worker(par, grid_list): #define a sklearn model clf = svm.SVC(C=g[0], gamma=g[1],probability=True,random_state=SEED) #run a cross validation function: returns error ll = my_cross_validation_function(X, y, model=clf, n=1, test_size=0.2) print(par, ll) grid_list.append((par, ll)) if __name__ == '__main__': manager = multiprocessing.Manager() GRID_hx = manager.list() jobs = [] for g in grid: p = multiprocessing.Process(target=worker, args=(g,GRID_hx)) jobs.append(p) p.start() p.join() print("\n-------------------") print("SORTED LIST") print("-------------------") L = sorted(GRID_hx, key=itemgetter(1)) for l in L[:5]: print l
Key Takeaways. Python is NOT a single-threaded language. Python processes typically use a single thread because of the GIL. Despite the GIL, libraries that perform computationally heavy tasks like numpy, scipy and pytorch utilise C-based implementations under the hood, allowing the use of multiple cores.
Common research programming languages use only one processor The “multi” in multiprocessing refers to the multiple cores in a computer's central processing unit (CPU). Computers originally had only one CPU core or processor, which is the unit that makes all our mathematical calculations possible.
An excellent solution is to use multiprocessing, rather than multithreading, where work is split across separate processes, allowing the operating system to manage access to shared resources. This also gets around one of the notorious Achilles Heels in Python: the Global Interpreter Lock (aka theGIL).
Your problem is that you join each job immediately after you started it:
for g in grid: p = multiprocessing.Process(target=worker, args=(g,GRID_hx)) jobs.append(p) p.start() p.join()
join blocks until the respective process has finished working. This means that your code starts only one process at once, waits until it is finished and then starts the next one.
In order for all processes to run in parallel, you need to first start them all and then join them all:
jobs = [] for g in grid: p = multiprocessing.Process(target=worker, args=(g,GRID_hx)) jobs.append(p) p.start() for j in jobs: j.join()
Documentation: link
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With