Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas DataFrame: aggregate values within blocks of repeating IDs

Given a DataFrame with an ID column and corresponding values column, how can I aggregate (let's say sum) the values within blocks of repeating IDs?

Example DF:

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {'id': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'b', 'a', 'b', 'b', 'b'],
     'v': np.ones(15)}
    )

Note that there's only two unique IDs, so a simple groupby('id') won't work. Also, the IDs don't alternate/repeat in a regular manner. What I came up with was to recreate the index, to represent the blocks of changed IDs:

# where id changes:
m = [True] + list(df['id'].values[:-1] != df['id'].values[1:])

# generate a new index from m:
idx, i = [], -1
for b in m:
    if b:
        i += 1
    idx.append(i)

# set as index:
df = df.set_index(np.array(idx))

# now I can use groupby:
df.groupby(df.index)['v'].sum()
# 0    5.0
# 1    3.0
# 2    2.0
# 3    1.0
# 4    1.0
# 5    3.0

This re-creation of the index feels sort-of not how you'd do this in pandas. What did I miss? Is there a better way to do this?

like image 490
FObersteiner Avatar asked Nov 22 '25 05:11

FObersteiner


1 Answers

Here is necessary create helper Series with compare shifted values for not equal by ne with cumulative sums and pass to groupby, for id column is possible pass together in list, remove first level of MultiIndex by first reset_index(level=0, drop=True) and then convert index to column id:

print (df['id'].ne(df['id'].shift()).cumsum())
0     1
1     1
2     1
3     1
4     1
5     2
6     2
7     2
8     3
9     3
10    4
11    5
12    6
13    6
14    6
Name: id, dtype: int32

df1 = (df.groupby([df['id'].ne(df['id'].shift()).cumsum(), 'id'])['v'].sum()
          .reset_index(level=0, drop=True)
          .reset_index())
print (df1)
  id    v
0  a  5.0
1  b  3.0
2  a  2.0
3  b  1.0
4  a  1.0
5  b  3.0

Another idea is useGroupBy.agg with dictioanry and aggregate id column by GroupBy.first:

df1 = (df.groupby(df['id'].ne(df['id'].shift()).cumsum(), as_index=False)
         .agg({'id':'first', 'v':'sum'}))
like image 145
jezrael Avatar answered Nov 23 '25 18:11

jezrael