Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting Unique Rows between Two DataFrames in Pandas

Tags:

python

pandas

I have two data frames A and B of unequal dimensions. I would like to create a data frame C such that it ONLY contains rows that are unique between A and B. I tried to follow this solution (excluding rows from a pandas dataframe based on column value and not index value) but could not get it to work.

Here's example:

Assume this to be DF_A:

    Star_ID         Loc_ID      pmRA        pmDE    Field     Jmag    Hmag  
 2M00000032+5737103  4264    0.000000    0.000000    N7789   10.905  10.635
 2M00000068+5710233  4264    8.000000    -18.000000  N7789   10.664  10.132
 2M00000222+5625359  4264    0.000000    0.000000    N7789   11.982  11.433
 2M00000818+5634264  4264    0.000000    0.000000    N7789   12.501  11.892
 2M00001242+5524391  4264    0.000000    -4.000000   N7789   12.091  11.482

And this to be DF_B:

2M00000032+5737103  
2M00000068+5710233
2M00001242+5524391

So, the first two and last Star_ID are common between DF_A and DF_B. I would like to create DF_C such that:

DF_C:

        Star_ID         Loc_ID      pmRA        pmDE    Field     Jmag    Hmag
     2M00000222+5625359  4264    0.000000    0.000000    N7789   11.982  11.433
     2M00000818+5634264  4264    0.000000    0.000000    N7789   12.501  11.892
like image 418
Rohit Avatar asked May 04 '14 19:05

Rohit


1 Answers

This worked for me:

In [7]:

df1[~df1.Star_ID.isin(df2.Star_ID)]

Out[7]:

              Star_ID  Loc_ID  pmRA  pmDE  Field    Jmag    Hmag
2  2M00000222+5625359    4264     0     0  N7789  11.982  11.433
3  2M00000818+5634264    4264     0     0  N7789  12.501  11.892

[2 rows x 7 columns]

So what we do here is we create a boolean mask, we ask for where Star_ID values is in both dataframes, however by using the ~ we NOT the condition which in effect negates it. The one you linked to is pretty much the same thing but I think you maybe didn't understand the syntax?

EDIT

In order to get both values that are only in df1 and values that are only in df2 you could do this

unique_vals = df1[~df1.Star_ID.isin(df2.Star_ID)].append(df2[~df2.Star_ID.isin(df1.Star_ID)], ignore_index=True)

Further edit

So the problem was that the csv had leading spaces, this caused all values to be unique in both datasets, to correct this you need to do this:

df1.Apogee_ID = df1.Apogee_ID.str.lstrip()
like image 143
EdChum Avatar answered Sep 20 '22 12:09

EdChum