Python pool not working in windows but works in linux

Tags:

I am trying python multiprocessing using Python 3.7.4 on Windows 10, Intel Core i7-8550U processor.
I am testing the multiprocessing with two functions, one with a basic sleep() and another with using the matthews_corrcoef from sklearn. The multiprocessing works with the sleep function, but not with the sklearn function.

import numpy as np
from sklearn.metrics import matthews_corrcoef
import time
import concurrent.futures
from multiprocessing import Process, Pool
from functools import partial
import warnings
import sys

class Runner():
  def sleeper(self, pred, man, thr = None):
    return time.sleep(2)

  def mcc_score(self, pred, man, thr = None):
    warnings.filterwarnings("ignore")
    return matthews_corrcoef(pred, man)

  def pool(self, func):
    t1 = time.perf_counter()
    p = Pool()
    meth = partial(func, pred, man)
    res = p.map(meth, thres)
    p.close()

    t2 = time.perf_counter()
    print(f'Pool {func.__name__} {round((t2-t1), 3)} seconds')

  def vanilla(self, func):
    t1 = time.perf_counter()
    for t in thres:
      func(pred, man)
    t2 = time.perf_counter()
    print(f'vanilla {func.__name__} {round((t2-t1), 3)} seconds')

if __name__== "__main__":
    print(sys.version)
    r = Runner()
    thres = np.arange(0,1, 0.3)
    print(f"Number of thresholds {len(thres)}")
    pred = [1]*200000
    man = [1]*200000
    results = []

    r.pool(r.mcc_score)
    r.vanilla(r.mcc_score)

    r.pool(r.sleeper)
    r.vanilla(r.sleeper)

In windows, for the mcc_score function, using pool is actually slower than the vanilla version, whereas in Linux it works properly.

Here are the sample outputs

#windows
3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]
Number of thresholds 4
Pool mcc_score 3.247 seconds
vanilla mcc_score 1.591 seconds
Pool sleeper 5.828 seconds
vanilla sleeper 8.001 seconds

#linux
3.7.0 (default, Jun 28 2018, 13:15:42) [GCC 7.2.0]
Number of thresholds 34
Pool mcc_score 1.946 seconds
vanilla mcc_score 8.817 seconds

I went through the documentation and other relevant questions in stackoverflow, where it mainly states using if __name__== "__main__":. Some help would be greatly appreciated as I have been stuck on this for quite some time now. If I missed any important information, please mention it, I will provide it.

227

asked Nov 04 '19 20:11

adib.mosharrof

1 Answers

First, I am going to simplify your code. As the methods in your class never make use of class variables, I will skip the class approach and use just methods.

Starting point is the example from the documentation for multiprocessing. To see the benefit from using Pool, I add two seconds of sleep and print a timestamp.

import datetime
from multiprocessing import Pool
import time

def fx(x):
    time.sleep(2)
    print(datetime.datetime.utcnow())
    return x*x

if __name__ == '__main__':
    with Pool() as p:
        print(p.map(fx, range(10)))

The output is as expected

2019-11-10 11:10:05.346985
2019-11-10 11:10:05.363975
2019-11-10 11:10:05.418941
2019-11-10 11:10:05.435931
2019-11-10 11:10:07.347753
2019-11-10 11:10:07.364741
2019-11-10 11:10:07.419707
2019-11-10 11:10:07.436697
2019-11-10 11:10:09.348518
2019-11-10 11:10:09.365508
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

As I did not specify the number of cores, all available cores are used (at my machine 4). This can be seen in the time stamps: 4 timestamps are close to each other. Than the exceution pauses until the cores are freed again.

You want to use a method matthews_corrcoef that takes according the documentation two arguments y_true and y_pred.

Before using that method, let's modify the test method from above to take two arguments:

def fxy(x, y):
    time.sleep(2)
    print(datetime.datetime.utcnow())
    return x*y

From the documentation for multiprocessing.pool.Pool we learn, that map takes only one argument. So I am going to use apply_async instead. As apply_async returns result objects instead the return values of the method, I am using a list to store the results and get the return values in a separate loop like this:

if __name__ == '__main__':
    with Pool() as p:
        res = []
        for i in range(10):
            res.append(p.apply_async(fxy, args = (i, i)))
        for item in res:
            print(item.get())

This gives a similar output to the first approach:

2019-11-10 11:41:24.987093
0
2019-11-10 11:41:24.996087
1
2019-11-10 11:41:25.008079
2019-11-10 11:41:25.002083
4
9
2019-11-10 11:41:26.988859
16
2019-11-10 11:41:27.009847
2019-11-10 11:41:27.009847
25
36
2019-11-10 11:41:27.011845
49
2019-11-10 11:41:28.989623
64
2019-11-10 11:41:29.019606
81

Now for matthews_corrcoef. For better verification of the results (your pred and man throw errors when applied to matthews_corrcoef), I am using the nomenclature and values like in the example in the documentation of matthews_corrcoef.

import datetime
from multiprocessing import Pool
import numpy as np
from sklearn.metrics import matthews_corrcoef

def mcc_score(y_true, y_pred): 
    print(datetime.datetime.utcnow())
    return matthews_corrcoef(y_true, y_pred)

y_true = [+1, +1, +1, -1]
y_pred = [+1, -1, +1, +1]

if __name__ == '__main__':
    with Pool() as p:
        res = []
        for i in range(10):
            res.append(p.apply_async(mcc_score, args = (y_true, y_pred)))
        for item in res:
            print(item.get())

Results like expected:

2019-11-10 11:49:07.309389
2019-11-10 11:49:07.345366
2019-11-10 11:49:07.375348
2019-11-10 11:49:07.393336
2019-11-10 11:49:07.412325
2019-11-10 11:49:07.412325
2019-11-10 11:49:07.412325
-0.3333333333333333
-0.3333333333333333
-0.3333333333333333
-0.3333333333333333
-0.3333333333333333
-0.3333333333333333
-0.3333333333333333
2019-11-10 11:49:07.420319
2019-11-10 11:49:07.420319
2019-11-10 11:49:07.413325
-0.3333333333333333
-0.3333333333333333
-0.3333333333333333

answered Sep 22 '22 06:09

mdk

Related questions
                            
                                Can't install dependencies in docker container
                            
                                How to return image stream and text as JSON response from Python Flask API
                            
                                Why cant I upload my own package to PyPI when my credential are working?
                            
                                How to turn groupby() and value_counts() into multiple pie/bar charts
                            
                                How to perform assert_has_calls for a __getitem__() call?
                            
                                cannot install tensorflow-text using pip despite having tensorflow 2.0.0-beta1 installed
                            
                                How to record mouse and keyboard movement simultaneously with Python?
                            
                                Hypothesis - reuse @given between tests
                            
                                Set background colour for a custom QWidget
                            
                                Airflow can not enter the /admin page after initdb
                            
                                One hot encoding of multi label images in keras
                            
                                How remove numbering from output after extract xls file with pandas [Python]
                            
                                Assign values to array efficiently
                            
                                Remove 2d slice from 3d numpy array
                            
                                Find all months between two date columns and generate row for each month
                            
                                How to set background color on image to white with OpenCV in Python
                            
                                torch.softmax and torch.sigmoid are not equivalent in the binary case
                            
                                How to list objects based on prefixes with wildcard using Python Boto3?
                            
                                Plotly: How to create subplots from each column in a pandas dataframe?
                            
                                Python GraphQL How to declare a self-referencing graphene object type

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python pool not working in windows but works in linux

Tags:

python

windows

python-multiprocessing

scikit-learn

python-pool

adib.mosharrof

People also ask

1 Answers

mdk

Recent Activity

Donate For Us