I have a matrix that looks like this:
M = [[1, 200],
[1.8, 100],
[2, 500],
[2.5, 300],
[3, 400],
[3.5, 200],
[5, 200],
[8, 100]]
I want to group the rows by a bin size (applied to the left column), e.g. for a bin size 2 (first bin is values from 0-2, second bin from 2-4, third bin from 4-6 etc):
[[1, 200],
[1.8, 100],
----
[2, 500],
[2.5, 300],
[3, 400],
[3.5, 200],
----
[5, 200],
----
[8, 100]]
Then output a new matrix with the sum of the right columns for each group:
[200+100, 500+300+400+200, 200, 100]
What is an efficient way to sum each value based on the bin_size boundaries?
pandas:Make a DataFrame and then use integer division to define your bins:
import pandas as pd
df = pd.DataFrame(M)
df.groupby(df[0]//2)[1].sum()
#0
#0.0 300
#1.0 1400
#2.0 200
#4.0 100
#Name: 1, dtype: int64
Use .tolist() to get your desired output:
df.groupby(df[0]//2)[1].sum().tolist()
#[300, 1400, 200, 100]
numpy.bincountimport numpy as np
gp, vals = np.transpose(M)
gp = (gp//2).astype(int)
np.bincount(gp, vals)
#array([ 300., 1400., 200., 0., 100.])
You can make use of np.digitize and a scipy.sparse.csr_matrix here:
bins = [2, 4, 6, 8, 10]
b = np.digitize(M[:, 0], bins)
v = M[:, 1]
Now using a vectorized groupby using a csr_matrix
from scipy import sparse
sparse.csr_matrix(
(v, b, np.arange(v.shape[0]+1)), (v.shape[0], b.max()+1)
).sum(0)
matrix([[ 300., 1400., 200., 0., 100.]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With