I want to deal with duplicates in a pandas df:
df=pd.DataFrame({'A':[1,1,1,2,1],'B':[2,2,1,2,1],'C':[2,2,1,1,1],'D':['a','c','a','c','c']})
df
I want to keep only rows with unique values of A, B, C an create binary columns D_a and D_c, so the results will be something like this without doing super slow loops on each row..
result= pd.DataFrame({'A':[1,1,2],'B':[2,1,2],'C':[2,1,1],'D_a':[1,1,0],'D_c':[1,1,1]})
Thanks a lot
You can use:
df1 = (df.groupby(['A','B','C'])['D']
.value_counts()
.unstack(fill_value=0)
.add_prefix('D_')
.clip_upper(1)
.reset_index()
.rename_axis(None, axis=1))
print (df1)
A B C D_a D_c
0 1 1 1 1 1
1 1 2 2 1 1
2 2 2 1 0 1
Using get_dummies
+ sum
-
df = df.set_index(['A', 'B', 'C'])\
.D.str.get_dummies()\
.sum(level=[0, 1, 2])\
.add_prefix('D_')\
.reset_index()
df
A B C D_a D_c
0 1 1 1 1 1
1 1 2 2 1 1
2 2 2 1 0 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With