Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Drop rows with multiple keys in pandas

Tags:

pandas

I have two dataframes, df1 and df2.

df1:

contig  position   tumor_f  t_ref_count  t_alt_count
1     14599  0.000000            1            0
1     14653  0.400000            3            2
1     14907  0.333333            6            3
1     14930  0.363636            7            4 

df2:

contig  position
1     14599
1     14653

I would like to remove the rows from df1 with matching contig, position values in df2. Something akin to: df1[df1[['contig','position']].isin(df2[['contig','position']])] Except this doesn't work.

like image 507
user1867185 Avatar asked Mar 22 '23 23:03

user1867185


2 Answers

Version .13 is adding an isin method to DataFrame that will accomplish this. If you're using the current master you can try:

In [46]: df1[['contig', 'position']].isin(df2.to_dict(outtype='list'))
Out[46]: 
  contig position
0   True     True
1   True     True
2   True    False
3   True    False

To get the elements not contained use ~ for not and index

In [45]: df1.ix[~df1[['contig', 'position']].isin(df2.to_dict(outtype='list')).
all(axis=1)]
Out[45]: 
   contig  position   tumor_f  t_ref_count  t_alt_count
2       1     14907  0.333333            6            3
3       1     14930  0.363636            7            4
like image 138
TomAugspurger Avatar answered Mar 25 '23 11:03

TomAugspurger


You can do this with the Series isin twice (works in 0.12):

In [21]: df1['contig'].isin(df2['contig']) & df1['position'].isin(df2['position'])
Out[21]:
0     True
1     True
2    False
3    False
dtype: bool

In [22]: ~(df1['contig'].isin(df2['contig']) & df1['position'].isin(df2['position']))
Out[22]:
0    False
1    False
2     True
3     True
dtype: bool

In [23]: df1[~(df1['contig'].isin(df2['contig']) & df1['position'].isin(df2['position']))]
Out[23]:
   contig  position   tumor_f  t_ref_count  t_alt_count
2       1     14907  0.333333            6            3
3       1     14930  0.363636            7            4

Perhaps we can get a neat solution in 0.13 (using DataFrame's isin like in Tom's answer).

It feel like there ought to be a neat way to do this using an inner merge...

In [31]: pd.merge(df1, df2, how="inner")
Out[31]:
   contig  position  tumor_f  t_ref_count  t_alt_count
0       1     14599      0.0            1            0
1       1     14653      0.4            3            2
like image 39
Andy Hayden Avatar answered Mar 25 '23 13:03

Andy Hayden