i have array like this :
array([[2, 1],
[3, 5],
[2, 1],
[4, 2],
[2, 3],
[5, 3]])
What i want to do is 'group-by' sum by first column and then sort by second column :
array([[2, 5],
[3, 5],
[5, 3],
[4, 2]])
and here comes the twist, I want to also get back the indexes from the original array of every row in the result array, sorted :
2 3 5 4
[[0,2,4], [1], [5], [3] ]
OR if its easy .. i need to get the top N indexes ... let say top 2 :
2 3
[0,2,4, 1]
No pandas, only pure numpy.
BTW, i need only the top N items and their indexes .. this may simplify speedup the process
trying to apply any of this:
https://izziswift.com/is-there-any-numpy-group-by-function
There is sadly no group-by in Numpy, but you can use np.unique
to find the unique elements and their index which is enough to implement what you need. One the keys as been identified, you can perform a key-based reduction using np.add.at
. For the sort by value, you can use np.argsort
. See this post and this one for more information.
keys, index = np.unique(df[:,0], return_inverse=True) # Find the unique key to group
values = np.zeros(len(keys), dtype=np.int64) # Sum-based accumulator
np.add.at(values, index, df[:,1]) # Key-based accumulation
tmp = np.hstack([keys[:,None], values[:,None]]) # Build the key-sum 2D array
res = tmp[tmp[:, 1].argsort()[::-1]] # Sort by value
Note that the index can be easily obtained from the index
variable (which is a reversed index). There is no way to build it with Numpy but this is possible using a simple python loop accumulating the index i
in lists stored in a dictionary for in each key keys[index[i]]
. Here is an example:
from collections import defaultdict
d = defaultdict(list)
for i in range(len(df)): d[keys[index[i]]].append(i)
I'm not happy with this solution and can't verify that it will not break with other data. It's using the referenced idea to group, but sums with add.reduceat
.
a = np.array(
[[2, 1],
[3, 5],
[2, 1],
[4, 2],
[2, 3],
[5, 3]])
s = a[:,0].argsort()
b = a[s]
groups, index = np.unique(b[:,0], return_index=True)
# splits = np.split(b[:,1], index[1:]) # if you need the groups
groupsum = np.stack([groups, np.add.reduceat(b[:,1], index)]).T
groupsum[(groupsum[:,1]*(-1)).argsort()]
Output
array([[2, 5],
[3, 5],
[5, 3],
[4, 2]])
To get the indices for each group
np.stack([groups.astype(object),np.split(np.arange(len(a))[s], index[1:])]).T
Output
array([[2, array([0, 2, 4])],
[3, array([1])],
[4, array([3])],
[5, array([5])]], dtype=object)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With