Let's say I have two dataframes, and the column names for both are:
table 1 columns:
[ShipNumber, TrackNumber, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]
I want to merge the two tables based on both ShipNumber and TrackNumber. However, if i simply use merge in the following way (pseudo code, not real code):
tab1.merge(tab2, "left", on=['ShipNumber','TrackNumber'])
then, that means the values in both ShipNumber and TrackNumber columns from both tables MUST MATCH.
However, in my case, sometimes the ShipNumber column values will match, sometimes the TrackNumber column values will match; as long as one of the two values match for a row, I want the merge to happen.
In other words, if row 1 ShipNumber in tab 1 matches row 3 ShipNumber in tab 2, but the TrackNumber in two tables for the two records do not match, I still want to match the two rows from the two tables.
So basically this is a either/or match condition (pesudo code):
if tab1.ShipNumber == tab2.ShipNumber OR tab1.TrackNumber == tab2.TrackNumber:
then merge
I hope my question makes sense... Any help is really really appreciated!
As suggested, I looked into this post: Python pandas merge with OR logic But it is not completely the same issue I think, as the OP from that post has a mapping file, and so they can simply do 2 merges to solve this. But I dont have a mapping file, rather, I have two df's with same key columns (ShipNumber, TrackNumber)
Use merge()
and concat()
. Then drop any duplicate cases where both A
and B
match (thanks @Scott Boston for that final step).
df1 = pd.DataFrame({'A':[3,2,1,4], 'B':[7,8,9,5]})
df2 = pd.DataFrame({'A':[1,5,6,4], 'B':[4,1,8,5]})
df1 df2
A B A B
0 3 7 0 1 4
1 2 8 1 5 1
2 1 9 2 6 8
3 4 5 3 4 5
With these data frames we should see:
df1.loc[0]
matches A
on df2.loc[0]
df1.loc[1]
matches B
on df2.loc[2]
df1.loc[3]
matches both A
and B
on df2.loc[3]
We'll use suffixes to keep track of what matched where:
suff_A = ['_on_A_match_1', '_on_A_match_2']
suff_B = ['_on_B_match_1', '_on_B_match_2']
df = pd.concat([df1.merge(df2, on='A', suffixes=suff_A),
df1.merge(df2, on='B', suffixes=suff_B)])
A A_on_B_match_1 A_on_B_match_2 B B_on_A_match_1 B_on_A_match_2
0 1.0 NaN NaN NaN 9.0 4.0
1 4.0 NaN NaN NaN 5.0 5.0
0 NaN 2.0 6.0 8.0 NaN NaN
1 NaN 4.0 4.0 5.0 NaN NaN
Note that the second and fourth rows are duplicate matches (for both data frames, A = 4
and B = 5
). We need to remove one of those sets.
dups = (df.B_on_A_match_1 == df.B_on_A_match_2) # also could remove A_on_B_match
df.loc[~dups]
A A_on_B_match_1 A_on_B_match_2 B B_on_A_match_1 B_on_A_match_2
0 1.0 NaN NaN NaN 9.0 4.0
0 NaN 2.0 6.0 8.0 NaN NaN
1 NaN 4.0 4.0 5.0 NaN NaN
I would suggest this alternate way for doing merge like this. This seems easier for me.
table1["id_to_be_merged"] = table1.apply(
lambda row: row["ShipNumber"] if pd.notnull(row["ShipNumber"]) else row["TrackNumber"], axis=1)
You can add the same column in table2
as well if needed and then use in left_in
or right_on
based on your requirement.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With