Let's say I have this data:
data = {
'batch_no': [42, 42, 52, 52, 52, 73],
'quality': ['OK', 'NOT OK', 'OK', 'NOT OK', 'NOT OK', 'OK'],
}
df = pd.DataFrame(data, columns = ['batch_no', 'quality'])
This gives me the following dataframe
batch_no quality
42 OK
42 NOT OK
52 OK
52 NOT OK
52 NOT OK
73 OK
Now I need to find the count of NOT OK for each batch_no.
I can achieve this using groupby and apply with a lamda function as follows:
df.groupby('batch_no')['quality'].apply(lambda x: x[x.eq('NOT OK')].count())
This gives me the following desired output
batch_no
42 1
52 2
73 0
However this is extremely slow even on my moderate sized data of around 3 million rows and is not feasible for my needs.
Is there a fast alternative to this ?
You can compare column quality, then groupby by batch_no and aggregate sum, Trues are processes like 1 so it count values:
df = df['quality'].eq('NOT OK')
.groupby(df['batch_no']).sum()
.astype(int)
.reset_index(name='count')
print (df)
batch_no count
0 42 1
1 52 2
2 73 0
Detail:
print (df['quality'].eq('NOT OK'))
0 False
1 True
2 False
3 True
4 True
5 False
Name: quality, dtype: bool
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With