Here is some code I wrote for parallelizing the creation of a dictionary
parallelize.py
if __name__ == "__main__":
import time
from multiprocessing import Pool
def assign_dict(alist):
return {x:x for x in alist}
my_dict = {}
size = 10000000
threshold=10000000
my_list=list(range(size))
start=time.time()
my_dict=assign_dict(my_list)
end=time.time()
print("check seq",end-start, " sec")
my_dict = {}
chunks = [my_list[i*threshold:(i+1)*threshold] for i in range(int(size/threshold))]
process_count = 7
pool = Pool(processes=process_count)
start = time.time()
inter_list = pool.map_async(assign_dict, chunks)
inter_list.wait()
inter_list=inter_list.get()
for some_dict in inter_list:
print("Combining...", time.time()-start, " sec elapsed")
my_dict = {**my_dict, **some_dict}
print("check 152002 as key ",my_dict[152002])
end=time.time()
print("check parallel",end-start, " sec")
Here is the output for size 1 mil with threshold 1 mil
check seq 0.6559352874755859 sec
Combining... 4.751460790634155 sec elapsed
check 152002 as key 152002
check parallel 5.064720869064331 sec
Here is output for size = 10mil with threshold 1 mil
check seq 0.668889045715332 sec
Combining... 1.6871337890625
Combining... 1.7269806861877441
Combining... 1.860083818435669
Combining... 2.0794677734375
Combining... 2.266465663909912
Combining... 2.47836971282959
Combining... 2.8915648460388184
Combining... 3.2443037033081055
Combining... 3.6063129901885986
Combining... 3.9933629035949707
check 115202 as key 1152002
check parallel 4.406447887420654 sec
Here is the output for size 100 mil with threshold 10 mil, the worst part here is even before the combining, the map_async
still takes 55 secs
compared to the 19 secs
in sequential.
check seq 19.182615041732788 sec
Combining... 55.18172788619995
Combining... 56.38586497306824
Combining... 58.534785747528076
Combining... 61.805513858795166
Combining... 64.75091290473938
Combining... 71.34392070770264
Combining... 76.02847385406494
Combining... 81.2545096874237
Combining... 87.75674867630005
Combining... 109.01232576370239
check 115202 as key 1152002
check parallel 126.1939218044281 sec
Even though I tried various combinations of size
and threshold
the code with pool is always slower, so it is not that the threshold is too big as the sequential version runs very quickly. Even when the size
is the same as the threshold
, the code with pool is many seconds slower.
And even for long running jobs like size = 1000 million, the parallelized version is way slower than sequential execution, meaning that there is no benefit of parallelization. I have 8 cores and 16GB RAM, I am running MacOSX, I even verified that my cores were running in parallel in the activity monitor to execute the task, yet it is slower. The combine phase does not take much time as shown. By the time the inter_list.get()
command ends, the parallel part is already done. So it cannot interfere with the dictionaries combining.
Can anyone please parallelize this code to be faster than the sequential version in Python 3 or at least help me understand why this is happening?
Parallel can actually be slower then serial in case when there is e.g. only one processor (e.g. single core CPU) and so the operation can't process in parallel. In parallel there will be small amount of time needed to switch the context and this time(s) will not be needed when processed serial.
In multiprocessing , multiple Python processes are created and used to execute a function instead of multiple threads, bypassing the Global Interpreter Lock (GIL) that can significantly slow down threaded Python programs.
In multiprocessing, there is no guarantee that the processes finish in a certain order. We have processes that calculate the square of a value. The input data is in certain order and we need to maintain this order.
Understand multiprocessing in no more than 6 minutes Multiprocessing is quintessential when a long-running process has to be speeded up or multiple processes have to execute parallelly. Executing a process on a single core confines its capability, which could otherwise spread its tentacles across multiple cores.
Your multiprocessing version is slower than the sequential version because of the inter-process communication required to pass the results from the Pool.map
from the workers back to the process from which the workers were forked.
Because of the GIL, Python's multiprocessing library is the suggested way to cpu intensive parallel tasks. However, this means that the virtual memory address spaces of each worker in the Pool
are different, and therefore the results of the Pool.map
must be serialized and passed between processes. Because the result of your Pool.map
is such a large array, this means your program is spending a large amount of time serializing/deserializing the answers and passing them between processes. In the sequential version, there is only a single process and thus the result never needs to be serialized and passed between processes and then deserialized, which is probably why it runs faster in this case.
If you want to avoid this bottleneck, you will want to try using Python's Shared Memory array which can avoid the inter-process communication bottleneck, since the array will be in the same virtual address space of all worker processes. If you really need a key-value pair mapping, then look into Python's multiprocessing.Manager.dict
.
In general, Pool.map
is good when you can parallelize some computation that does not produce a large quantity of data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With