What would be an efficient (time, easy) way of grouping a 2D
NumPy
matrix rows by different column conditions (e.g. group by column 2 values) and running f1()
and f2()
on each of those groups?
Thanks
If you have an array arr
of shape (rows, cols)
, you can get the vector of all values in column 2 as
col = arr[:, 2]
You can then construct a boolean array with your grouping condition, say group 1 is made up of those rows with have a value larger than 5 in column 2:
idx = col > 5
You can apply this boolean array directly to your original array to select rows:
group_1 = arr[idx]
group_2 = arr[~idx]
For example:
>>> arr = np.random.randint(10, size=(6,4))
>>> arr
array([[0, 8, 7, 4],
[5, 2, 6, 9],
[9, 5, 7, 5],
[6, 9, 1, 5],
[8, 0, 5, 8],
[8, 2, 0, 6]])
>>> idx = arr[:, 2] > 5
>>> arr[idx]
array([[0, 8, 7, 4],
[5, 2, 6, 9],
[9, 5, 7, 5]])
>>> arr[~idx]
array([[6, 9, 1, 5],
[8, 0, 5, 8],
[8, 2, 0, 6]])
A compact solution is to use numpy_indexed (disclaimer: I am its author), which implements a fully vectorized solution to this type of problem:
The simplest way to use it is as:
import numpy_indexed as npi
npi.group_by(arr[:, col1]).mean(arr)
But this also works:
# run function f1 on each group, formed by keys which are the rows of arr[:, [col1, col2]
npi.group_by(arr[:, [col1, col2]], arr, f1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With