Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count specific values and aggregating result in dataframe using transform

I have a dataframe similar to this :

    Errorid  Matricule Priority
0      1        01       P1
1      2        01       P2
2      3        01       NC
3      4        02       P1
4      5        02       P4
5      6        02       EDC
6      7        02       P2

This list all the errors for a Matricule and their priority.

What I want to do is count all the errors for a Matricule while excluding "NC" and "EDC" and put the result in the same dataframe.

Result example :

    Errorid  Matricule Priority  NberrorsMatricule
0      1        01       P1           2
1      2        01       P2           2
2      3        01       NC           2
3      4        02       P1           3
4      5        02       P4           3
5      6        02       EDC          3
6      7        02       P2           3

I tried multiple things like below:

DF['NberrorsMatricule'] = DF.groupby('Matricule')['Pirority'].transform(lambda x : x.count() if x in ['P1','P2','P3','P4']) 

DF['NberrorsMatricule'] = DF.groupby('Matricule')[DF['Pirority'] in ['P1','P2','P3','P4']].transform("count")

Each time I get an ambiguous value error. ValueError: The truth value of a series is ambiguous. Use a.empty(), a.bool(), a.item(), a.any(), a.all().

Note that this one work :

DF['NberrorsMatricule'] = DF.groupby('Matricule')['Pirority'].transform("count") 

But it obviously don't filter the pirority.

These dataframe are example, in reality I work with a huge amount of data ( more than 400k occurrence in this one) So If someone can help me understand the behavior of transform(), and how to efficiently filter the data It would be very nice.

Thanks in advance for your help

like image 544
zonas Avatar asked Feb 11 '26 05:02

zonas


2 Answers

You can replace non matched values to missing values by Series.where and Series.isin, so if use GroupBy.transform with GroupBy.count it exclude missing values:

L = ['P1','P2','P3','P4']
df['NberrorsMatricule'] = (df['Priority'].where(df['Priority'].isin(L))
                                         .groupby(df['Matricule'])
                                         .transform('count'))
print (df)
   Errorid  Matricule Priority  NberrorsMatricule
0        1          1       P1                  2
1        2          1       P2                  2
2        3          1       NC                  2
3        4          2       P1                  3
4        5          2       P4                  3
5        6          2      EDC                  3
6        7          2       P2                  3

Details:

print (df['Priority'].where(df['Priority'].isin(L)))
0     P1
1     P2
2    NaN
3     P1
4     P4
5    NaN
6     P2
Name: Priority, dtype: object

Another solution is count matched values by sum, for convert True and False to 1, 0 is possible use Series.view or Series.astype:

df['NberrorsMatricule'] = (df['Priority'].isin(L)
                                         .view('i1')
                                         .groupby(df['Matricule'])
                                         .transform('sum'))
print (df)

   Errorid  Matricule Priority  NberrorsMatricule
0        1          1       P1                  2
1        2          1       P2                  2
2        3          1       NC                  2
3        4          2       P1                  3
4        5          2       P4                  3
5        6          2      EDC                  3
6        7          2       P2                  3
like image 59
jezrael Avatar answered Feb 13 '26 18:02

jezrael


Like this:

In [567]:  df['NberrorsMatricule'] = df[~df.Priority.isin(['NC', 'EDC'])].\ 
     ...:                               groupby('Matricule')['Errorid']\ 
     ...:                               .transform('count')                                                                          

To remove Nan, use ffill():

In [595]: df['NberrorsMatricule'] = df['NberrorsMatricule'].ffill()                                                                                                                                         

In [596]: df                                                                                                                                                                                                
Out[596]: 
   Errorid  Matricule Priority  NberrorsMatricule
0        1          1       P1                2.0
1        2          1       P2                2.0
2        3          1       NC                2.0
3        4          2       P1                3.0
4        5          2       P4                3.0
5        6          2      EDC                3.0
6        7          2       P2                3.0
like image 43
Mayank Porwal Avatar answered Feb 13 '26 17:02

Mayank Porwal



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!