Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to group rows in a Numpy 2D matrix based on column values?

Tags:

python

numpy

What would be an efficient (time, easy) way of grouping a 2D NumPy matrix rows by different column conditions (e.g. group by column 2 values) and running f1() and f2() on each of those groups?

Thanks

like image 620
d1337 Avatar asked Dec 03 '22 23:12

d1337


2 Answers

If you have an array arr of shape (rows, cols), you can get the vector of all values in column 2 as

col = arr[:, 2]

You can then construct a boolean array with your grouping condition, say group 1 is made up of those rows with have a value larger than 5 in column 2:

idx = col > 5

You can apply this boolean array directly to your original array to select rows:

group_1 = arr[idx]
group_2 = arr[~idx]

For example:

>>> arr = np.random.randint(10, size=(6,4))
>>> arr
array([[0, 8, 7, 4],
       [5, 2, 6, 9],
       [9, 5, 7, 5],
       [6, 9, 1, 5],
       [8, 0, 5, 8],
       [8, 2, 0, 6]])
>>> idx = arr[:, 2] > 5
>>> arr[idx]
array([[0, 8, 7, 4],
       [5, 2, 6, 9],
       [9, 5, 7, 5]])
>>> arr[~idx]
array([[6, 9, 1, 5],
       [8, 0, 5, 8],
       [8, 2, 0, 6]])
like image 153
Jaime Avatar answered Dec 09 '22 16:12

Jaime


A compact solution is to use numpy_indexed (disclaimer: I am its author), which implements a fully vectorized solution to this type of problem:

The simplest way to use it is as:

import numpy_indexed as npi
npi.group_by(arr[:, col1]).mean(arr)

But this also works:

# run function f1 on each group, formed by keys which are the rows of arr[:, [col1, col2]
npi.group_by(arr[:, [col1, col2]], arr, f1)
like image 26
Eelco Hoogendoorn Avatar answered Dec 09 '22 15:12

Eelco Hoogendoorn