Ok, after some searching I can't seem to find a SO question that directly tackles this. I've looked into masked arrays and although they seem cool, I'm not sure if they are what I need.
consider 2 numpy arrays:
zone_data
is a 2-d numpy array with clumps of elements with the same value. This is my 'zones'.
value_data
is a 2-d numpy array (exact shape of zone_data) with arbitrary values.
I seek a numpy array of same shape as zone_data/value_data that has the average values of each zone in place of the zone numbers.
example...in ascii art form.
zone_data
(4 distinct zones):
1, 1, 2, 2
1, 1, 2, 2
3, 3, 4, 4
3, 4, 4, 4
value_data
:
1, 2, 3, 6
3, 0, 2, 5
1, 1, 1, 0
2, 4, 2, 1
my result, call it result_data
:
1.5, 1.5, 4.0, 4.0
1.5, 1.5, 4.0, 4.0
2.0, 2.0, 1.0, 1.0
2.0, 2.0, 1.0, 1.0
here's the code I have. It works fine as far as giving me a perfect result.
result_data = np.zeros(zone_data.shape)
for i in np.unique(zone_data):
result_data[zone_data == i] = np.mean(value_data[zone_data == i])
My arrays are big and my code snippet takes several seconds. I think I have a knowledge gap and haven't hit on anything helpful. The loop aspect needs to be delegated to a library or something...aarg!
I seek help to make this FASTER! Python gods, I seek your wisdom!
EDIT -- adding benchmark script
import numpy as np
import time
zones = np.random.randint(1000, size=(2000,1000))
values = np.random.rand(2000,1000)
print 'start method 1:'
start_time = time.time()
result_data = np.zeros(zones.shape)
for i in np.unique(zones):
result_data[zones == i] = np.mean(values[zones == i])
print 'done method 1 in %.2f seconds' % (time.time() - start_time)
print
print 'start method 2:'
start_time = time.time()
#your method here!
print 'done method 2 in %.2f seconds' % (time.time() - start_time)
my output:
start method 1:
done method 1 in 4.34 seconds
start method 2:
done method 2 in 0.00 seconds
You could use np.bincount
:
count = np.bincount(zones.flat)
tot = np.bincount(zones.flat, weights=values.flat)
avg = tot/count
result_data2 = avg[zones]
which gives me
start method 1:
done method 1 in 3.13 seconds
start method 2:
done method 2 in 0.01 seconds
>>>
>>> np.allclose(result_data, result_data2)
True
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With