Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: concatenating conditioned on unique values

I am concatenating two Pandas dataframes as below.

part1 = pd.DataFrame({'id'    :[100,200,300,400,500], 
                   'amount': np.random.randn(5)
                    })

part2 = pd.DataFrame({'id'    :[700,100,800,500,300], 
                   'amount': np.random.randn(5)
                    })

concatenated = pd.concat([part1, part2], axis=0)
     amount   id
0 -0.458653  100
1  2.172348  200
2  0.072494  300
3 -0.253939  400
4 -0.061866  500
0 -1.187505  700
1 -0.810784  100
2  0.321881  800
3 -1.935284  500
4 -1.351507  300

How can I limit the operation so that a row in part2 is only included in concatenated if the row id does not already appear in part1? In a way, I want to treat the id column like a set.

Is it possible to do this during concat() or is this more a post-processing step?

Desired output for this example would be:

concatenated_desired
     amount   id
0 -0.458653  100
1  2.172348  200
2  0.072494  300
3 -0.253939  400
4 -0.061866  500
0 -1.187505  700
2  0.321881  800
like image 891
Zhubarb Avatar asked Sep 13 '25 16:09

Zhubarb


1 Answers

call drop_duplicates() after concat():

part1 = pd.DataFrame({'id'    :[100,200,300,400,500], 
                   'amount': np.arange(5)
                    })

part2 = pd.DataFrame({'id'    :[700,100,800,500,300], 
                   'amount': np.random.randn(5)
                    })

concatenated = pd.concat([part1, part2], axis=0)
print concatenated.drop_duplicates(cols="id")
like image 179
HYRY Avatar answered Sep 16 '25 06:09

HYRY