Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting a count for each category in a Pandas DataFrame

Tags:

python

pandas

I have a dataframe similar to the following:

import pandas as pd
data = {"Name":["Andrew","Andrew","Andrew","Andrew","Andrew","Andrew","Andrew", "Sam", "Sam", "Sam", "Sam", "Sam"], "PASS":[0, 1, 1, 0, 1, 1, 1, 0, 1, 1,0,1]}
df = pd.DataFrame(data=data)

Output

    Name    PASS
0   Andrew  0
1   Andrew  1
2   Andrew  1
3   Andrew  0
4   Andrew  1
5   Andrew  1
6   Andrew  1
7   Sam     0
8   Sam     1
9   Sam     1
10  Sam     0
11  Sam     1

I want to generate a dataframe which contains the largest consecutive passes for each student:

    Name    MAX_PASS
0   Andrew  3
1   Sam     2

I need a little help modifying the code that I have so far. The count is outputting 0110110110 and the result = 2. Which is not quite correct. I think I'm close but need some help to get over the finish line. Thanks.

count = ''
for i in range(len(df)-1):
    if df.Name[i] == df.Name[i+1]:
        if df.PASS[i] == 0:
            count += "0"
        else:
            count += "1"  
            result = len(max(count.split('0')))
like image 809
AndronikMk Avatar asked Oct 15 '25 07:10

AndronikMk


2 Answers

You can consider to adapt this answer

def max_strike_group(x, col):
    x = x[col]
    a = x != 0
    out = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
    return out.max()

df.groupby("Name").apply(lambda x:max_strike_group(x, "PASS"))
Name
Andrew    3
Sam       2
dtype: int64
like image 113
rpanai Avatar answered Oct 20 '25 14:10

rpanai


One option is to call on cumsum twice, the first time is to add up the 0s and 1s, the second time is to get the values after the reset:

TL-DR:

cum1 = df.groupby('Name').PASS.cumsum()
cum1 = np.where(cum1.shift() == cum1, cum1 * -1, df.PASS)
(df.assign(PASS = cum1, 
          max_pass = lambda df: df.groupby('Name').cumsum())
.groupby('Name')
.max_pass
.max()
)

Name
Andrew    3
Sam       2
Name: max_pass, dtype: int64

Explanation:

# first cumulative sum
cum1 = df.groupby('Name').PASS.cumsum()
cum1
0     0
1     1
2     2
3     2
4     3
5     4
6     5
7     0
8     1
9     2
10    2
11    3
Name: PASS, dtype: int64
# look for rows where the reset should occur
cum1 = np.where(cum1.shift() == cum1, cum1 * -1, df.PASS)
cum1
array([ 0,  1,  1, -2,  1,  1,  1,  0,  1,  1, -2,  1])
# build the max_pass column
# with the second cumsum and groupby
# before grouping again to get the max
(df.assign(PASS = cum1, 
          max_pass = lambda df: df.groupby('Name').cumsum())
.groupby('Name')
.max_pass
.max()
)

Name
Andrew    3
Sam       2
Name: max_pass, dtype: int64
like image 28
sammywemmy Avatar answered Oct 20 '25 13:10

sammywemmy