I don’t understand why time of my calculation’s is longer while I using 28-30 cores than when I use 12-16 cores on AWS EC2 c3.8xlarge. I did some test and result are in chart below:
https://www.dropbox.com/s/8u32jttxmkvnacd/Slika%20zaslona%202015-01-11%20u%2018.33.20.png?dl=0
The fastest computation is when I use 13 cores. So if i use maximal cores, it is same time as i am using 8 cores of c3.8xlarge:
https://www.dropbox.com/s/gf3bevbi8dwk5vh/Slika%20zaslona%202015-01-11%20u%2018.32.53.png?dl=0
This is simplified code of code that i use.
import random
import multiprocessing as mp
import threading as th
import numpy as np
x=mp.Value('f',0)
y=mp.Value('f',0)
arr=[]
tasks=[]
nesto=[]
def calculation2(some_array):
global x, y, arr
p=False
a = np.sum(some_array)*random.random()
b = a **(random.random())
if a > x.value:
x.value=a
y.value=b
arr=some_array
p=True
if p:
return x.value, y.value, arr
def calculation1(number_of_pool):
global tasks
pool=mp.Pool(number_of_pool)
for i in range(1,500):
some_array=np.random.randint(100, size=(1, 4))
tasks+=[pool.apply_async(calculation2,args=(some_array,))]
def exec_activator():
global x, y, arr
while tasks_gen.is_alive() or len(tasks)>0:
try:
task=tasks.pop(0)
x.value, y.value, arr = task.get()
except:
pass
def results(task_act):
while task_act.is_alive():
pass
else:
print x.value
print y.value
print arr
tasks_gen=th.Thread(target=calculation1,args=(4,))
task_act=th.Thread(target=exec_activator)
result_print=th.Thread(target=results,args=(task_act,))
tasks_gen.start()
task_act.start()
result_print.start()
It’s core are 2 calculation’s:
The goal of code is to find array that compute maximum x, and return its y. The two calculations start simultaneously (with threading) because sometimes there are too many array’s that take up too much RAM.
My goal is to do the fastest computation. I need advice how to use all cores if possible.
Sorry in advance if bad english. If you need more information be please to ask.
The c3.8xlarge is an Ivy Bridge quad core system. It uses Hyper-Threading; it doesn't really have 32 (hardware)independent processing units.
There's often no point in trying to parallelism a CPU bound task across more OS processes than what their are processors in the hardware. In fact, quite often it's detrimental due to the resource overhead and context switching (which is what you're seeing).
It likely depends on your specific applications, and experimentation will help you find the sweet spot (which it sounds like you've done).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With