Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Groupby count only when a certain value is present in one of the column in pandas

I have a dataframe similar to the below mentioned database:

+------------+-----+--------+ | time | id | status | +------------+-----+--------+ | 1451606400 | id1 | Yes | | 1451606400 | id1 | Yes | | 1456790400 | id2 | No | | 1456790400 | id2 | Yes | | 1456790400 | id2 | No | +------------+-----+--------+

I'm grouping by all the columns mentioned above and i'm able to get the count in a different column named 'count' successfully using the below command:

df.groupby(['time','id', 'status']).size().reset_index(name='count')

But I want the count in the above dataframe only in those rows with status = 'Yes' and rest should be '0'

Desired Output:

+------------+-----+--------+---------+ | time | id | status | count | +------------+-----+--------+---------+ | 1451606400 | id1 | Yes | 2 | | 1456790400 | id2 | Yes | 1 | | 1456790400 | id2 | No | 0 | +------------+-----+--------+---------+

I tried to count for status = 'Yes' with the below code:

df[df['status']== 'Yes'].groupby(['time','id','status']).size().reset_index(name='count')

which obviously gives me those rows with status = 'Yes' and discarded the rest. I want the discarded ones with count = 0

Is there any way to get the result?

Thanks in advance!

like image 231
maninekkalapudi Avatar asked Dec 23 '22 03:12

maninekkalapudi


2 Answers

Use lambda function with apply and for count sum boolena True values proccesses like 1:

df1 = (df.groupby(['time','id','status'])
         .apply(lambda x: (x['status']== 'Yes').sum())
         .reset_index(name='count'))

Or create new column and aggregate sum:

df1 = (df.assign(A=df['status']=='Yes')
         .groupby(['time','id','status'])['A']
         .sum()
         .astype(int)
         .reset_index(name='count'))

Very similar solution with no new column, but worse readable a bit:

df1 = ((df['status']=='Yes')
        .groupby([df['time'],df['id'],df['status']])
        .sum()
        .astype(int)
        .reset_index(name='count'))

print (df)
         time   id status  count
0  1451606400  id1    Yes      2
1  1456790400  id2     No      0
2  1456790400  id2    Yes      1
like image 174
jezrael Avatar answered Apr 24 '23 20:04

jezrael


If you don't mind a slightly different output format, you can pd.crosstab:

df = pd.DataFrame({'time': [1451606400]*2 + [1456790400]*3,
                   'id': ['id1']*2 + ['id2']*3,
                   'status': ['Yes', 'Yes', 'No', 'Yes', 'No']})

res = pd.crosstab([df['time'], df['id']], df['status'])

print(res)

status          No  Yes
time       id          
1451606400 id1   0    2
1456790400 id2   2    1

The result is a more efficient way to store your data as you don't repeat your index in a separate row for every "Yes" / "No" category.

like image 42
jpp Avatar answered Apr 24 '23 22:04

jpp