I'm generating 100 random int matrices of size 1000x1000. I'm using the multiprocessing module to calculate the eigen values of the 100 matrices.
The code is given below:
import timeit
import numpy as np
import multiprocessing as mp
def calEigen():
S, U = np.linalg.eigh(a)
def multiprocess(processes):
pool = mp.Pool(processes=processes)
#Start timing here as I don't want to include time taken to initialize the processes
start = timeit.default_timer()
results = [pool.apply_async(calEigen, args=())]
stop = timeit.default_timer()
print (processes":", stop - start)
results = [p.get() for p in results]
results.sort() # to sort the results
if __name__ == "__main__":
global a
a=[]
for i in range(0,100):
a.append(np.random.randint(1,100,size=(1000,1000)))
#Print execution time without multiprocessing
start = timeit.default_timer()
calEigen()
stop = timeit.default_timer()
print stop - start
#With 1 process
multiprocess(1)
#With 2 processes
multiprocess(2)
#With 3 processes
multiprocess(3)
#With 4 processes
multiprocess(4)
The output is
0.510247945786
('Process:', 1, 5.1021575927734375e-05)
('Process:', 2, 5.698204040527344e-05)
('Process:', 3, 8.320808410644531e-05)
('Process:', 4, 7.200241088867188e-05)
Another iteration showed this output:
69.7296020985
('Process:', 1, 0.0009050369262695312)
('Process:', 2, 0.023727893829345703)
('Process:', 3, 0.0003509521484375)
('Process:', 4, 0.057518959045410156)
My questions are these:
I have edited the code given in the comments below. I want the serial and multiprocessing functions to find the eigen values for the same list of 100 matrices. The edited code is-
import numpy as np
import time
from multiprocessing import Pool
a=[]
for i in range(0,100):
a.append(np.random.randint(1,100,size=(1000,1000)))
def serial(z):
result = []
start_time = time.time()
for i in range(0,100):
result.append(np.linalg.eigh(z[i])) #calculate eigen values and append to result list
end_time = time.time()
print("Single process took :", end_time - start_time, "seconds")
def caleigen(c):
result = []
result.append(np.linalg.eigh(c)) #calculate eigenvalues and append to result list
return result
def mp(x):
start_time = time.time()
with Pool(processes=x) as pool: # start a pool of 4 workers
result = pool.map_async(caleigen,a) # distribute work to workers
result = result.get() # collect result from MapResult object
end_time = time.time()
print("Mutltiprocessing took:", end_time - start_time, "seconds" )
if __name__ == "__main__":
serial(a)
mp(1,a)
mp(2,a)
mp(3,a)
mp(4,a)
There is no reduction in the time as the number of processes increases. Where am I going wrong? Does multiprocessing divide the list into chunks for the processes or do I have to do the division?
You're not using the multiprocessing module correctly. As @dopstar pointed out, you're not dividing your task. There is only one task for the process pool, so not matter how many workers you assigned, only one will get the job. As for your second question, I didn't use timeit to measure process time precisely. I just use time module to get a crude sense of how fast things are. It serves the purpose most of the time, though. If I understand what you're trying to do correctly, this should be the single process version of your code
import numpy as np
import time
result = []
start_time = time.time()
for i in range(100):
a = np.random.randint(1, 100, size=(1000,1000)) #generate random matrix
result.append(np.linalg.eigh(a)) #calculate eigen values and append to result list
end_time = time.time()
print("Single process took :", end_time - start_time, "seconds")
The single process version took 15.27 seconds on my computer. Below is the multiprocess version, which took only 0.46 seconds on my computer. I also included the single process version for comparison. (The single process version has to be enclosed in the if block as well and placed after the multiprocess version.) Because you would like to repeat your calculation for 100 times, it'd be a lot easier to create a pool of workers and let them take on unfinished task automatically than to manually start each process and specify what each process should do. Here in my codes, the argument for the caleigen call is merely to keep track of how many times the task has been executed. Finally, map_async is generally faster than apply_async, with its downside being consuming slightly more memory and taking only one argument for function call. The reason for using map_async but not map is that in this case, the order in which result is returned does not matter and map_async is much faster than map.
from multiprocessing import Pool
import numpy as np
import time
def caleigen(x): # define work for each worker
a = np.random.randint(1,100,size=(1000,1000))
S, U = np.linalg.eigh(a)
return S, U
if __name__ == "main":
start_time = time.time()
with Pool(processes=4) as pool: # start a pool of 4 workers
result = pool.map_async(caleigen, range(100)) # distribute work to workers
result = result.get() # collect result from MapResult object
end_time = time.time()
print("Mutltiprocessing took:", end_time - start_time, "seconds" )
# Run the single process version for comparison. This has to be within the if block as well.
result = []
start_time = time.time()
for i in range(100):
a = np.random.randint(1, 100, size=(1000,1000)) #generate random matrix
result.append(np.linalg.eigh(a)) #calculate eigen values and append to result list
end_time = time.time()
print("Single process took :", end_time - start_time, "seconds")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With