Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

numpy group by, returning original indexes sorted by the result

i have array like this :

array([[2, 1],
       [3, 5],
       [2, 1],
       [4, 2],
       [2, 3],
       [5, 3]])

What i want to do is 'group-by' sum by first column and then sort by second column :

array([[2, 5],
       [3, 5],
       [5, 3],
       [4, 2]])

and here comes the twist, I want to also get back the indexes from the original array of every row in the result array, sorted :

     2       3     5    4
 [[0,2,4],  [1],  [5], [3] ]

OR if its easy .. i need to get the top N indexes ... let say top 2 :

     2       3    
  [0,2,4,    1]

No pandas, only pure numpy.

BTW, i need only the top N items and their indexes .. this may simplify speedup the process


trying to apply any of this:

https://izziswift.com/is-there-any-numpy-group-by-function

like image 643
sten Avatar asked Oct 19 '25 12:10

sten


2 Answers

There is sadly no group-by in Numpy, but you can use np.unique to find the unique elements and their index which is enough to implement what you need. One the keys as been identified, you can perform a key-based reduction using np.add.at. For the sort by value, you can use np.argsort. See this post and this one for more information.

keys, index = np.unique(df[:,0], return_inverse=True) # Find the unique key to group
values = np.zeros(len(keys), dtype=np.int64)          # Sum-based accumulator
np.add.at(values, index, df[:,1])                     # Key-based accumulation
tmp = np.hstack([keys[:,None], values[:,None]])       # Build the key-sum 2D array
res = tmp[tmp[:, 1].argsort()[::-1]]                  # Sort by value

Note that the index can be easily obtained from the index variable (which is a reversed index). There is no way to build it with Numpy but this is possible using a simple python loop accumulating the index i in lists stored in a dictionary for in each key keys[index[i]]. Here is an example:

from collections import defaultdict
d = defaultdict(list)
for i in range(len(df)): d[keys[index[i]]].append(i)
like image 68
Jérôme Richard Avatar answered Oct 21 '25 01:10

Jérôme Richard


I'm not happy with this solution and can't verify that it will not break with other data. It's using the referenced idea to group, but sums with add.reduceat.

a = np.array(
      [[2, 1],
       [3, 5],
       [2, 1],
       [4, 2],
       [2, 3],
       [5, 3]])

s = a[:,0].argsort()
b = a[s]
groups, index = np.unique(b[:,0], return_index=True)
# splits = np.split(b[:,1], index[1:]) # if you need the groups
groupsum = np.stack([groups, np.add.reduceat(b[:,1], index)]).T
groupsum[(groupsum[:,1]*(-1)).argsort()]

Output

array([[2, 5],
       [3, 5],
       [5, 3],
       [4, 2]])

To get the indices for each group

np.stack([groups.astype(object),np.split(np.arange(len(a))[s], index[1:])]).T

Output

array([[2, array([0, 2, 4])],
       [3, array([1])],
       [4, array([3])],
       [5, array([5])]], dtype=object)
like image 39
Michael Szczesny Avatar answered Oct 21 '25 01:10

Michael Szczesny