Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas aggregate count higher than threshold

I have a data frame that I want to groupby. I want to use df.agg to determine the length that exceed above 180.

Is there a possible way to write a small function for it?

I tried len(nice_numbers[nice_numbers > 180]) but it did not work.

df = pd.DataFrame(data = {'nice_numbers': [60, 64, 67, 70, 73, 75, 130, 180, 184, 186, 187, 187, 188, 194, 199, 195, 200, 210, 220, 222, 224, 250, 70, 40, 30, 300], 'activity': 'sleeping', 'sleeping', 'sleeping', 'walking', 'walking', 'walking', 'working', 'working', 'working', 'working', 'working', 'restaurant', 'restaurant', 'restaurant', 'restaurant', 'walking', 'walking', 'walking', 'working', 'working', 'driving', 'driving', 'driving', 'home', 'home', 'home}')
df_gb = df.groupby('activity')
df_gb.agg({'count_frequency_over_180'})

thank you

like image 542
DonChilliConCarne Avatar asked Feb 18 '26 12:02

DonChilliConCarne


1 Answers

Create boolean mask by compare column by gt with aggregate sum for count Trues values:

df1 = (df['nice_numbers'].gt(180)
                         .groupby(df['activity'], sort=False)
                         .sum()
                         .astype(int)
                         .reset_index())

Similar solution with sum by index created by set_index:

df1 = df.set_index('activity')['nice_numbers'].gt(180).sum(level=0).astype(int).reset_index()
print (df1)
     activity  nice_numbers
0    sleeping             0
1     walking             3
2     working             5
3  restaurant             4
4     driving             2
5        home             1

EDIT:

For more metrics for nice_numbers column use agg:

agg = ('abobe_180_count', lambda x: x.gt(180).sum()), ('average', 'mean')
df1 = df.groupby('activity')['nice_numbers'].agg(agg).reset_index()
print (df1)
     activity  abobe_180_count     average
0     driving                2  181.333333
1        home                1  123.333333
2  restaurant                4  192.000000
3    sleeping                0   63.666667
4     walking                3  137.166667
5     working                5  187.000000

For multiple threshold use:

df1 = pd.DataFrame({'threshold':[180, 270, 60]})
print (df1.head())
   threshold
0        180
1        270
2         60

#compare values by numpy broadcasting
arr = df['nice_numbers'].to_numpy()[:, None] > df1['threshold'].to_numpy()

#create new DataFrame and add column activity
df2 = (pd.DataFrame(arr, index=df.index, columns=df1['threshold'].tolist())
         .assign(activity = df['activity']))
print (df2.head())
     180    270     60  activity
0  False  False  False  sleeping
1  False  False   True  sleeping
2  False  False   True  sleeping
3  False  False   True   walking
4  False  False   True   walking

#aggregate sum
df3 = df2.groupby('activity', as_index=False).sum()
print (df3)
     activity  180  270  60
0     driving    2    0   3
1        home    1    1   1
2  restaurant    4    0   4
3    sleeping    0    0   2
4     walking    3    0   6
5     working    5    0   7
like image 53
jezrael Avatar answered Feb 20 '26 02:02

jezrael



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!