I have a dataframe similar to the below mentioned database:
+------------+-----+--------+
| time | id | status |
+------------+-----+--------+
| 1451606400 | id1 | Yes |
| 1451606400 | id1 | Yes |
| 1456790400 | id2 | No |
| 1456790400 | id2 | Yes |
| 1456790400 | id2 | No |
+------------+-----+--------+
I'm grouping by all the columns mentioned above and i'm able to get the count in a different column named 'count'
successfully using the below command:
df.groupby(['time','id', 'status']).size().reset_index(name='count')
But I want the count in the above dataframe only in those rows with status = 'Yes'
and rest should be '0'
Desired Output:
+------------+-----+--------+---------+
| time | id | status | count |
+------------+-----+--------+---------+
| 1451606400 | id1 | Yes | 2 |
| 1456790400 | id2 | Yes | 1 |
| 1456790400 | id2 | No | 0 |
+------------+-----+--------+---------+
I tried to count for status = 'Yes'
with the below code:
df[df['status']== 'Yes'].groupby(['time','id','status']).size().reset_index(name='count')
which obviously gives me those rows with status = 'Yes'
and discarded the rest. I want the discarded ones with count = 0
Is there any way to get the result?
Thanks in advance!
Use lambda function with apply
and for count sum
boolena True
values proccesses like 1
:
df1 = (df.groupby(['time','id','status'])
.apply(lambda x: (x['status']== 'Yes').sum())
.reset_index(name='count'))
Or create new column and aggregate sum
:
df1 = (df.assign(A=df['status']=='Yes')
.groupby(['time','id','status'])['A']
.sum()
.astype(int)
.reset_index(name='count'))
Very similar solution with no new column, but worse readable a bit:
df1 = ((df['status']=='Yes')
.groupby([df['time'],df['id'],df['status']])
.sum()
.astype(int)
.reset_index(name='count'))
print (df)
time id status count
0 1451606400 id1 Yes 2
1 1456790400 id2 No 0
2 1456790400 id2 Yes 1
If you don't mind a slightly different output format, you can pd.crosstab
:
df = pd.DataFrame({'time': [1451606400]*2 + [1456790400]*3,
'id': ['id1']*2 + ['id2']*3,
'status': ['Yes', 'Yes', 'No', 'Yes', 'No']})
res = pd.crosstab([df['time'], df['id']], df['status'])
print(res)
status No Yes
time id
1451606400 id1 0 2
1456790400 id2 2 1
The result is a more efficient way to store your data as you don't repeat your index in a separate row for every "Yes" / "No" category.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With