I have a huge dataframe with many columns and rows. I sort it by 3 columns (['color', 'category','level_1']) and i have one column for check ('check_control') My problem is: how can add a increasing number for any subdataframe for (['color', 'category','level_1']) ?
import pandas as pd
data = {'Name_1': ['dshgfj', 'krihgfjsh', 'dfgfd', 'dff','gcgt','hjgjh','et','asd','hhj','hg','jh','jhk'],
'Name_2': ['dsf', 'dsfs', 'bilsdfl', 'sfsd', 'tcbcvbo','hghjg','bvnbhj','hga','kjk','kjh','hjjh','jkh'],
'color': ['black', 'red', 'white', 'yellow', 'blue','green','orange','black','green','black','green','black'],
'category': ['a', 'b', 'a', 'b', 'b','c','c','a','c','a','c','a'],
'level_1': [1, 1, 1, 1, 1,1,1,2,2,3,3,4],
'check_control':['','check','','check','','check','check','','','check','check','check']
}
df = pd.DataFrame(data)
print(df)
df_fix = df.sort_values(['color', 'category','level_1'], ascending=[True, True,True])
print(df_fix)
Name_1 Name_2 color category level_1 check_control
0 dshgfj dsf black a 1
1 krihgfjsh dsfs red b 1 check
2 dfgfd bilsdfl white a 1
3 dff sfsd yellow b 1 check
4 gcgt tcbcvbo blue b 1
5 hjgjh hghjg green c 1 check
6 et bvnbhj orange c 1 check
7 asd hga black a 2
8 hhj kjk green c 2
9 hg kjh black a 3 check
10 jh hjjh green c 3 check
11 jhk jkh black a 4 check
Name_1 Name_2 color category level_1 check_control
0 dshgfj dsf black a 1
7 asd hga black a 2
9 hg kjh black a 3 check
11 jhk jkh black a 4 check
4 gcgt tcbcvbo blue b 1
5 hjgjh hghjg green c 1 check
8 hhj kjk green c 2
10 jh hjjh green c 3 check
6 et bvnbhj orange c 1 check
1 krihgfjsh dsfs red b 1 check
2 dfgfd bilsdfl white a 1
3 dff sfsd yellow b 1 check
my solution must be like
data_solution = {'Name_1': ['dshgfj', 'krihgfjsh', 'dfgfd', 'dff','gcgt','hjgjh','et','asd','hhj','hg','jh','jhk'],
'Name_2': ['dsf', 'dsfs', 'bilsdfl', 'sfsd', 'tcbcvbo','hghjg','bvnbhj','hga','kjk','kjh','hjjh','jkh'],
'color': ['black', 'red', 'white', 'yellow', 'blue','green','orange','black','green','black','green','black'],
'category': ['a', 'b', 'a', 'b', 'b','c','c','a','c','a','c','a'],
'level_1': [1, 1, 1, 1, 1,1,1,2,2,3,3,4],
'check_control':['','1_check','','1_check','','1_check','1_check','','','1_check','2_check','2_check']
}
df_solution = pd.DataFrame(data_solution)
df_fix_solution = df_solution.sort_values(['color', 'category','level_1'], ascending=[True, True,True])
print(df_fix_solution)
Name_1 Name_2 color category level_1 check_control
0 dshgfj dsf black a 1
7 asd hga black a 2
9 hg kjh black a 3 1_check
11 jhk jkh black a 4 2_check
4 gcgt tcbcvbo blue b 1
5 hjgjh hghjg green c 1 1_check
8 hhj kjk green c 2
10 jh hjjh green c 3 2_check
6 et bvnbhj orange c 1 1_check
1 krihgfjsh dsfs red b 1 1_check
2 dfgfd bilsdfl white a 1
3 dff sfsd yellow b 1 1_check
IIUC, keep only rows with check_control=='check' then use groupby_cumcount to add the expected identifier:
df_fix['check_control'] = (df_fix[df_fix['check_control'] == 'check']
.groupby(['color', 'category']).cumcount().add(1)
.astype(str) + '_' + df_fix['check_control']).fillna('')
Output:
>>> df_fix
Name_1 Name_2 color category level_1 check_control
0 dshgfj dsf black a 1
7 asd hga black a 2
9 hg kjh black a 3 1_check
11 jhk jkh black a 4 2_check
4 gcgt tcbcvbo blue b 1
5 hjgjh hghjg green c 1 1_check
8 hhj kjk green c 2
10 jh hjjh green c 3 2_check
6 et bvnbhj orange c 1 1_check
1 krihgfjsh dsfs red b 1 1_check
2 dfgfd bilsdfl white a 1
3 dff sfsd yellow b 1 1_check
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With