Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Occurrence counts of values for each group as separate columns

I have a table which has close to 31million records. There are around 10 columns , two of which are card number and transaction_status. Each card can have multiple rows. so there could be 2000 rows of same card , with each row as a transaction with a corresponding status

transaction_Status has value of 'Y'/'N'.

i want to add another two columns to this table, 'count_of_approved', 'count_of_rejected' using pandas dataframe.

How do i do that? trn so far, i have been using get_dummies() and merge(), however it takes a lot of time and worse, causes out of memory error.

so say my input is as below:

trn_id | card_id | status
1      | c1      | Y
2      | c2      | Y
3      | c2      | N
4      | c3      | Y 
5      | c3      | Y 

i would like my output to be

trn_id | card_id | status | num_approved | num_of_denied
1      | c1      | Y      | 1            | 0
2      | c2      | Y      | 1            | 1
3      | c2      | N      | 1            | 1
4      | c3      | Y      | 2            | 0
5      | c3      | Y      | 2            | 0

my code so far as given below:

import pandas as panda
a = panda.DataFrame({'id':[1,2,3],'c_id':[22,22,33], 'status':['Y','Y','N']})
temp = a.status.str.get_dummies()
a[['N','Y']]= temp
tt = a.groupby(['c_id'])['Y'].count()
tt=tt.reset_index()
yes_count_added = a.merge(tt,how='right',on='c_id')
yes_count_added.rename(columns = {'Y_y':'num_of_approved'})
like image 751
user1906450 Avatar asked Jan 02 '23 11:01

user1906450


2 Answers

You could GroupBy card_id and use transform with a lambda expression to sum the amount of times status equals either Y for the num_approved or N for the num_of_denied using eq:

df['num_approved'] = df.groupby('card_id').status.transform(
                                lambda x: x.eq('Y').sum())
df['num_of_denied'] = df.groupby('card_id').status.transform(
                                 lambda x: x.eq('N').sum())

     trn_id card_id  status    num_approved    num_of_denied
0       1      c1      Y             1              0
1       2      c2      Y             1              1
2       3      c2      N             1              1
3       4      c3      Y             2              0
4       5      c3      Y             2              0
like image 115
yatu Avatar answered Jan 14 '23 00:01

yatu


Use str.get_dummies + a single groupby call for performance:

df.status.str.get_dummies().groupby(df.card_id).transform('sum')

   N  Y
0  0  1
1  1  1
2  1  1
3  0  2
4  0  2

v = (df.status
       .str.get_dummies()
       .groupby(df.card_id)
       .transform('sum')
       .rename({'Y': 'num_approved', 'N': 'num_denied'}, axis=1))

pd.concat([df, v], axis=1)

   trn_id card_id status  num_denied  num_approved
0       1      c1      Y           0             1
1       2      c2      Y           1             1
2       3      c2      N           1             1
3       4      c3      Y           0             2
4       5      c3      Y           0             2
like image 39
cs95 Avatar answered Jan 13 '23 23:01

cs95