I have a numpy array which is lexically sorted on first 2 columns like this:
c1 c2 c3
2 0.9 3223
2 0.8 7899
2 0.7 23211
2 0.6 3232
2 0.5 4478
1 0.9 342
1 0.8 3434
1 0.7 24232
1 0.6 332
1 0.5 478
I want value of c3 for two top rows of each c1. So I want output like: 3223,7899, 342, 3434
What is the easiest way to do it in Python
Assuming you have it in a numpy array like this: (ignore the scientific notation)
In [86]: arr
Out[86]:
array([[ 1.00000000e+00, 9.00000000e-01, 3.22300000e+03],
[ 1.00000000e+00, 8.00000000e-01, 7.89900000e+03],
[ 1.00000000e+00, 7.00000000e-01, 2.32110000e+04],
[ 1.00000000e+00, 6.00000000e-01, 3.23200000e+03],
[ 1.00000000e+00, 5.00000000e-01, 4.47800000e+03],
[ 2.00000000e+00, 9.00000000e-01, 3.42000000e+02],
[ 2.00000000e+00, 8.00000000e-01, 3.43400000e+03],
[ 2.00000000e+00, 7.00000000e-01, 2.42320000e+04],
[ 2.00000000e+00, 6.00000000e-01, 3.32000000e+02],
[ 2.00000000e+00, 5.00000000e-01, 4.78000000e+02]])
You can do:
arr[np.roll(arr[:,0], k) != arr[:,0],2]
Example:
In [87]: arr[np.roll(arr[:,0], 2) != arr[:,0],2]
Out[87]: array([ 3223., 7899., 342., 3434.])
Explanation:
We shift (roll) c1 of k positions to get c1'. The rows where c1 != c1' are the first k rows for each distinct value of c1 (or less than k if that value of c1 does not have at least k rows). We use this to index the original array and get the c3 values we want.
It should also be completely vectorized and therefore quite efficient. Finding the first 5 values for each c1 in an array with 100000 rows and 1000 different c1 values (c1 from 1 to 1000, c2 from 100 to 1 for each c1, c3 random) takes only ~2.4ms on my computer:
In [132]: c1 = np.repeat(np.linspace(1,1000, 1000), 100)
In [133]: c2 = np.tile(np.linspace(100, 1, 100), 1000)
In [134]: c3 = np.random.random_integers(1, 10000, size=100000)
In [135]: arr = np.column_stack((c1, c2, c3))
In [136]: arr
Out[136]:
array([[ 1.00000000e+00, 1.00000000e+02, 2.21700000e+03],
[ 1.00000000e+00, 9.90000000e+01, 9.23000000e+03],
[ 1.00000000e+00, 9.80000000e+01, 1.47900000e+03],
...,
[ 1.00000000e+03, 3.00000000e+00, 7.41600000e+03],
[ 1.00000000e+03, 2.00000000e+00, 2.08000000e+03],
[ 1.00000000e+03, 1.00000000e+00, 3.41300000e+03]])
In [137]: %timeit arr[ np.roll(arr[:,0], 5) != arr[:,0], 2]
100 loops, best of 3: 2.36 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With