Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas join without replacement

This is a bit hard to explain but I'm going to try my best. What I've got right now is two tables I need to join together, but we don't really have a unique join id. I have a couple columns to join on that is the best I can do, and I just want to know when we don't have equal numbers on both sides of the joins. Right now, if the right table has 1 match to the 2 entries on the left table, that 1 match joins to both entries. That leaves me not knowing the right table only has 1 entry vs the 2 for the left.

What I want is to join a right table to a left (outer), but I don't want to join the right table more than once per entry. So if the right table index 3 could be joined on index 1 and 2 on the left, I only want it to be joined on index 1. Also, if index 3 and index 4 could be joined on index 1 and 2, I want index 1 to be matched with index 3, and index 2 to be matched with index 4. If there is only 1 match (Index 1 --> 3), but Index 2 on the left table could be matched to index 3, I want Index 2 to not be joined.

Examples may best describe this:

a_df = pd.DataFrame.from_dict({1: {'match_id': 2, 'uniq_id': 1}, 2: {'match_id': 2, 'uniq_id': 2}}, orient='index')

In [99]: a_df
Out[99]:
   match_id  uniq_id
1         2        1
2         2        2


In [100]: b_df = pd.DataFrame.from_dict({3: {'match_id': 2, 'uniq_id': 3}, 4: {'match_id': 2, 'uniq_id': 4}}, orient='index')

In [101]: b_df
Out[101]:
   match_id  uniq_id
3         2        3
4         2        4

In this example, I want a_df to join onto b_df. I want b_df uniq_id 3 to be matched with a_df uniq_id 1, and b_df 4 to a_df 2.

Output would look like this:

Out[106]:
   match_id_right  match_id  uniq_id  uniq_id_right
1               2         2        1              3
2               2         2        2              4

Now let's say we want to join a_df to c_df:

In [104]: c_df = pd.DataFrame.from_dict({3: {'match_id': 2, 'uniq_id': 3}, 4: {'match_id': 3, 'uniq_id': 4}}, orient='index')

In [105]: c_df
Out[105]:
   match_id  uniq_id
3         2        3
4         3        4

In this case, we have match_ids of 2 on a_df, and only 1 match_id of 2 on c_df.

In this case I just want uniq_id 1 to be matched with uniq_id 3, leaving both uniq_id 2 and uniq_id 4 to be unmatched

   match_id_right  match_id  uniq_id  uniq_id_right
1               2         2        1              3
2             NaN         2        2            NaN
4               3       NaN      NaN              4
like image 938
user1610719 Avatar asked Mar 01 '16 22:03

user1610719


People also ask

What are the four types of join in Pandas?

Inner Join. Left Outer Join. Right Outer Join. Full Outer Join or simply Outer Join.

What is difference between merge and join in Pandas?

Both join and merge can be used to combines two dataframes but the join method combines two dataframes on the basis of their indexes whereas the merge method is more versatile and allows us to specify columns beside the index to join on for both dataframes.


1 Answers

Alright guys, so the answer is actually pretty simple.

What you need to do is group each dataframe (left, right) by the matching column(s), and then add a new counter column for each group.

Now you do the outer join and include the counter column, so you'll match on 0,1, but if the right has a 2 then it doesn't match. If the left only has 0, it will match the right but if the right has 0,1, then the right '1' entry doesn't match!

Edit: Code request.

I don't have anything handy, but it is very simple. If you have, say, 2 columns you're matching on ['amount','date'], then you simply do a

left_df['Helper'] = left_df.groupby(['amount','date']).cumcount()
right_df['RHelper'] = right_df.groupby(['amount','date']).cumcount()

Then use the Helper column in the join.

like image 164
user1610719 Avatar answered Sep 29 '22 11:09

user1610719