Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pandas merge with OR logic

I'm searching and haven't found an answer to this question, can you perform a merge of pandas dataframes using OR logic? Basically, the equivalent of a SQL merge using "where t1.A = t2.A OR t1.A = t2.B".

I have a situation where I am pulling information from one database into a dataframe (df1) and I need to merge it with information from another database, which I pulled into another dataframe (df2), merging based on a single column (col1). If these always used the same value when they matched, it would be very straightforward. The situation I have is that sometimes they match and sometimes they use a synonym. There is a third database that has a table that provides a lookup between synonyms for this data entity (col1 and col1_alias), which could be pulled into a third dataframe (df3). What I am looking to do is merge the columns I need from df1 and the columns I need from df2.

As stated above, in cases where df1.col1 and df2.col1 match, this would work...

df = df1.merge(df2, on='col1', how='left')

However, they don't always have the same value and sometimes have the synonyms. I thought about creating df3 based on when df3.col1 was in df1.col1 OR df3.col1_alias was in df1.col1. Then, creating a single list of values from df3.col1 and df3.col1_alias (list1) and selecting df2 based on df2.col1 in list1. This would give me the rows from df2 I need but, that still wouldn't put me in position to merge df1 and df2 matching the appropriate rows. I think if there an OR merge option, I can step through this and make it work, but all of the following threw a syntax error:

df = df1.merge((df3, left_on='col1', right_on='col1', how='left')|(df3, left_on='col1', right_on='col1_alias', how='left'))

and

df = df1.merge(df3, (left_on='col1', right_on='col1')|(left_on='col1', right_on='col1_alias'), how='left')

and

df = df1.merge(df3, left_on='col1', right_on='col1'|right_on='col1_alias', how='left')

and several other variations. Any guidance on how to perform an OR merge or suggestions on a completely different approach to merging df1 and df2 using the synonyms in two columns in df3?

like image 692
stlouismv Avatar asked May 11 '17 21:05

stlouismv


1 Answers

I think I would do this as two merges:

In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6]], columns=["A", "B"])

In [12]: df2 = pd.DataFrame([[1, 7], [2, 8], [4, 9]], columns=["C", "D"])

In [13]: res = df.merge(df2, left_on="B", right_on="C", how="left")

In [14]: res.update(df.merge(df2, left_on="A", right_on="C", how="left"))

In [15]: res
Out[15]:
   A  B    C    D
0  1  2  1.0  7.0
1  3  4  4.0  9.0
2  5  6  NaN  NaN

As you can see this picks A = 1 -> D = 7 rather than B = 2 -> D = 8.

Note: For more extensibility (matching different columns) it might make sense to pull out a single column, although they're both the same in this example:

In [21]: res = df.merge(df2, left_on="B", right_on="C", how="left")["C"]

In [22]: res.update(df.merge(df2, left_on="A", right_on="C", how="left")["C"])

In [23]: res
Out[23]:
0    1.0
1    4.0
2    NaN
Name: C, dtype: float64
like image 63
Andy Hayden Avatar answered Nov 06 '22 17:11

Andy Hayden