Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas - Deal with duplicates

Tags:

python

pandas

I want to deal with duplicates in a pandas df:

df=pd.DataFrame({'A':[1,1,1,2,1],'B':[2,2,1,2,1],'C':[2,2,1,1,1],'D':['a','c','a','c','c']})
df 

I want to keep only rows with unique values of A, B, C an create binary columns D_a and D_c, so the results will be something like this without doing super slow loops on each row..

result= pd.DataFrame({'A':[1,1,2],'B':[2,1,2],'C':[2,1,1],'D_a':[1,1,0],'D_c':[1,1,1]})

Thanks a lot

like image 664
toumz Avatar asked Mar 08 '23 08:03

toumz


2 Answers

You can use:

df1 = (df.groupby(['A','B','C'])['D']
         .value_counts()
         .unstack(fill_value=0)
         .add_prefix('D_')
         .clip_upper(1)
         .reset_index()  
         .rename_axis(None, axis=1))

print (df1)
   A  B  C  D_a  D_c
0  1  1  1    1    1
1  1  2  2    1    1
2  2  2  1    0    1
like image 111
jezrael Avatar answered Mar 21 '23 09:03

jezrael


Using get_dummies + sum -

df = df.set_index(['A', 'B', 'C'])\
       .D.str.get_dummies()\
       .sum(level=[0, 1, 2])\
       .add_prefix('D_')\
       .reset_index()

df

   A  B  C  D_a  D_c
0  1  1  1    1    1
1  1  2  2    1    1
2  2  2  1    0    1
like image 40
cs95 Avatar answered Mar 21 '23 10:03

cs95