Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to identify whats NOT in the inner join during dataframe merge

I have one frame that looks like this (df):

                                 2000q1         2000q2         2000q3  
State        RegionName                                                  
New York     New York                NaN            NaN            NaN   
California   Los Angeles   207066.666667  214466.666667  220966.666667   
Illinois     Chicago       138400.000000  143633.333333  147866.666667

(notice that State,RegionName here is a MultiIndex)

and one frame that looks like this (ut):

     State    RegionName
0  Alabama        Auburn
1  Alabama      Florence
2  Alabama  Jacksonville
3  Alabama    Livingston
4  Alabama    Montevallo

So to get all rows where State,RegionName are in both dataframes, I do this:

dfut = pd.merge(df, ut, how='inner', left_index=True, right_on=['State', 'RegionName'])

That works. I now want a list of rows where rows from df frame are NOT in ut frame -- like a "NOT inner join". I am pretty sure that I need to do a LEFT join which will give me the entire df, but I am not sure how to subtract ut intersecting rows out of it. Hope its clear. Thank you

like image 820
alernerdev Avatar asked Nov 10 '16 01:11

alernerdev


2 Answers

include the parameter indicator=True in your merge and query('_merge != "both"')

dfut = pd.merge(df, ut, how='outer',
                left_index=True, right_on=['State', 'RegionName'],
                indicator=True)

dfut.query('_merge != "both"')

enter image description here

like image 34
piRSquared Avatar answered Oct 21 '22 17:10

piRSquared


There is a much simpler way without query

You want to do the inner merge of df and ut to get all the 'matching' indexes of df (i.e., indexes of df that are in ut)

inner_merge_index = pd.merge(df.reset_index(), ut, how='inner').set_index(['State', 'RegionName']).index

Then, you want to remove all these matched indexes from df

dfut = df.drop(list(inner_merge_index))

Note this also works with MultiIndex. To test it, you may want to expand your test df (because it has 0 intersection with ut) by adding a row that is actually preserved when doing the inner merge (added 'Alabama'/'Jacksonville' below). If you test it, you will see this is the only row that gets dropped.

df = pd.DataFrame([['New York', 'New York', 'NaN', 'NaN', 'NaN'],
                   ['California', 'Los Angeles', '207066.666667', '214466.666667', '220966.666667'],
                   ['Illinois', 'Chicago', '138400.000000', '143633.333333', '147866.666667'],
                   ['Alabama', 'Jacksonville', '98765.000000', '54321.333333', '12345.0']],
                  columns=['State', 'RegionName', '2000q1', '2000q2', '2000q3']).set_index(['State', 'RegionName'])
like image 87
KingOtto Avatar answered Oct 21 '22 16:10

KingOtto