I have a dataframe similar to this :
Errorid Matricule Priority
0 1 01 P1
1 2 01 P2
2 3 01 NC
3 4 02 P1
4 5 02 P4
5 6 02 EDC
6 7 02 P2
This list all the errors for a Matricule and their priority.
What I want to do is count all the errors for a Matricule while excluding "NC" and "EDC" and put the result in the same dataframe.
Result example :
Errorid Matricule Priority NberrorsMatricule
0 1 01 P1 2
1 2 01 P2 2
2 3 01 NC 2
3 4 02 P1 3
4 5 02 P4 3
5 6 02 EDC 3
6 7 02 P2 3
I tried multiple things like below:
DF['NberrorsMatricule'] = DF.groupby('Matricule')['Pirority'].transform(lambda x : x.count() if x in ['P1','P2','P3','P4'])
DF['NberrorsMatricule'] = DF.groupby('Matricule')[DF['Pirority'] in ['P1','P2','P3','P4']].transform("count")
Each time I get an ambiguous value error. ValueError: The truth value of a series is ambiguous. Use a.empty(), a.bool(), a.item(), a.any(), a.all().
Note that this one work :
DF['NberrorsMatricule'] = DF.groupby('Matricule')['Pirority'].transform("count")
But it obviously don't filter the pirority.
These dataframe are example, in reality I work with a huge amount of data ( more than 400k occurrence in this one) So If someone can help me understand the behavior of transform(), and how to efficiently filter the data It would be very nice.
Thanks in advance for your help
You can replace non matched values to missing values by Series.where and Series.isin, so if use GroupBy.transform with GroupBy.count it exclude missing values:
L = ['P1','P2','P3','P4']
df['NberrorsMatricule'] = (df['Priority'].where(df['Priority'].isin(L))
.groupby(df['Matricule'])
.transform('count'))
print (df)
Errorid Matricule Priority NberrorsMatricule
0 1 1 P1 2
1 2 1 P2 2
2 3 1 NC 2
3 4 2 P1 3
4 5 2 P4 3
5 6 2 EDC 3
6 7 2 P2 3
Details:
print (df['Priority'].where(df['Priority'].isin(L)))
0 P1
1 P2
2 NaN
3 P1
4 P4
5 NaN
6 P2
Name: Priority, dtype: object
Another solution is count matched values by sum, for convert True and False to 1, 0 is possible use Series.view or Series.astype:
df['NberrorsMatricule'] = (df['Priority'].isin(L)
.view('i1')
.groupby(df['Matricule'])
.transform('sum'))
print (df)
Errorid Matricule Priority NberrorsMatricule
0 1 1 P1 2
1 2 1 P2 2
2 3 1 NC 2
3 4 2 P1 3
4 5 2 P4 3
5 6 2 EDC 3
6 7 2 P2 3
Like this:
In [567]: df['NberrorsMatricule'] = df[~df.Priority.isin(['NC', 'EDC'])].\
...: groupby('Matricule')['Errorid']\
...: .transform('count')
To remove Nan, use ffill():
In [595]: df['NberrorsMatricule'] = df['NberrorsMatricule'].ffill()
In [596]: df
Out[596]:
Errorid Matricule Priority NberrorsMatricule
0 1 1 P1 2.0
1 2 1 P2 2.0
2 3 1 NC 2.0
3 4 2 P1 3.0
4 5 2 P4 3.0
5 6 2 EDC 3.0
6 7 2 P2 3.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With