Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Resolve Pandas data frame merge conflicts with a function?

Tags:

python

pandas

Let's say I have two dataframes, which I would like to merge, but there is a conflict because rows and columns overlap. Instead of duplicating the rows, I would like to pass a function to resolve the conflict. Can this be done?

import numpy as np
import pandas as pd
dates1 = pd.date_range("2014-01-01", periods = 4)
dates2 = pd.date_range("2014-01-03", periods = 4)
cols1 = list("ABCD")
cols2 = list("CDEF")
df1 = pd.DataFrame(np.ones([4, 4], dtype = "bool"), index = dates1, columns = cols1)
df2 = pd.DataFrame(np.zeros([4, 4], dtype = "bool"), index = dates2, columns = cols2)

In [317]: df1
Out[317]: 
               A     B     C     D
2014-01-01  True  True  True  True
2014-01-02  True  True  True  True
2014-01-03  True  True  True  True
2014-01-04  True  True  True  True

In [318]: df2
Out[318]: 
                C      D      E      F
2014-01-03  False  False  False  False
2014-01-04  False  False  False  False
2014-01-05  False  False  False  False
2014-01-06  False  False  False  False

So as you can see, the two data frames overlap in columns C and D, and in rows 2014-01-03 and 2014-01-04. So now when I merge them I get repeated rows because of this conflict:

In [321]: pd.concat([df1, df2])
Out[321]: 
               A     B      C      D      E      F
2014-01-01  True  True   True   True    NaN    NaN
2014-01-02  True  True   True   True    NaN    NaN
2014-01-03  True  True   True   True    NaN    NaN
2014-01-04  True  True   True   True    NaN    NaN
2014-01-03   NaN   NaN  False  False  False  False
2014-01-04   NaN   NaN  False  False  False  False
2014-01-05   NaN   NaN  False  False  False  False
2014-01-06   NaN   NaN  False  False  False  False

When what I actually want is True values to override Falses (or NaN), which I could do, for example, with an "or" function passed to resolve such duplication conflicts. Can this be done in Pandas?

The result should look like this:

               A     B      C      D      E      F
2014-01-01  True  True   True   True    NaN    NaN
2014-01-02  True  True   True   True    NaN    NaN
2014-01-03  True  True   True   True  False  False
2014-01-04  True  True   True   True  False  False
2014-01-05   NaN   NaN  False  False  False  False
2014-01-06   NaN   NaN  False  False  False  False

That is, where there is no duplication, the value in the two data frames comes through, where there is no data in either frame, a NaN is returned, but where there is data in both frames, True overrides False (that is, "or").

I am looking for a general solution for arbtraging between conflicts when merging Pandas DataFrames, preferably via passed function.

like image 647
Thomas Browne Avatar asked Jul 31 '14 23:07

Thomas Browne


3 Answers

Instead of using concat use merge:

>> pd.merge(df1, df2, on=(df1.columns & df2.columns).tolist(), how='outer', left_index=True, right_index=True)
               A     B      C      D      E      F
2014-01-01  True  True   True   True    NaN    NaN
2014-01-02  True  True   True   True    NaN    NaN
2014-01-03  True  True   True   True  False  False
2014-01-04  True  True   True   True  False  False
2014-01-05   NaN   NaN  False  False  False  False
2014-01-06   NaN   NaN  False  False  False  False

The on=(df1.columns & df2.columns).tolist() argument gives you a list of overlapping columns (in this case ['C','D'])

The how='outer' does a union of keys from both frames (SQL: full outer join)

The left_index=True and the right_index=True keep the row indexes intact

like image 194
dloman Avatar answered Oct 18 '22 18:10

dloman


This should work for what you want to do:

def conflict_resolver(x):

    # If there is only one row, just return it as is
    if x.shape[0] == 1:
        return x
    # If all values are nan, just return the first row
    elif x.isna().all():
        return x[:1]
    else:
        # Remove na values and drop duplicates
        x = x.dropna().drop_duplicates()

        # If only 1 row of non-na data exists, just return it
        if x.shape[0] == 1:
            return x
        else:

            # Handle conflicts here:
            
            if isinstance(x, bool):
                x.iloc[0] = x.any()
                return x[:1]

concat_df = pd.concat([df1, df2]).reset_index(drop=False).groupby(by='index').agg(conflict_resolver)
like image 28
Garrett Argenna Avatar answered Oct 18 '22 19:10

Garrett Argenna


This question was found when having a similar need to combine columns with a simple conflict resolution: Values in one column override those of another. Versus create and pass in a resolving function, pandas provides a helper Series.combine_first(other) that picks the value of the caller over that of other.

like image 1
Rich Andrews Avatar answered Oct 18 '22 17:10

Rich Andrews