Merge and update dataframes based on a subset of their columns

Question

I wonder whether there is the fastest code to replace the two for loops, assuming the df size is very large. In my real case, each dataframe is 200 rows and 25 columns.

data_df1 = np.array([['Name','Unit','Attribute','Date'],['a','A',1,2014],['b','B',2,2015],['c','C',3,2016],\
                 ['d','D',4,2017],['e','E',5,2018]])
data_df2 = np.array([['Name','Unit','Date'],['a','F',2019],['b','G',2020],['e','H',2021],\
                 ['f','I',2022]])
df1 = pd.DataFrame(data=data_df1)
print('df1:')
print(df1)
df2 = pd.DataFrame(data=data_df2)
print('df2:')
print(df2)
row_df1 = [1,2,5]
col_df1 = [1,3]
row_df2 = [1,2,3]
col_df2 = [1,2]
for i in range(0,len(row_df1)):
    for j in range(0, len(col_df1)):
        df1.set_value(row_df1[i],col_df1[j], df2.loc[row_df2[i],col_df2[j]])
print('df1 after operation:')
print(df1)

Expected output:

df1:
      0     1          2     3
0  Name  Unit  Attribute  Date
1     a     A          1  2014
2     b     B          2  2015
3     c     C          3  2016
4     d     D          4  2017
5     e     E          5  2018
df2:
      0     1     2
0  Name  Unit  Date
1     a     F  2019
2     b     G  2020
3     e     H  2021
4     f     I  2022
df1 after operation:
      0     1          2     3
0  Name  Unit  Attribute  Date
1     a     F          1  2019
2     b     G          2  2020
3     c     C          3  2016
4     d     D          4  2017
5     e     H          5  2021

I have tried:

df1.loc[[1,2,5],[1,3]] = df2.loc[[1,2,3],[1,2]]
print('df1:')
print(df1)
print('df2:')
print(df2)

but the outcome is the following. There are unexpected Nan.

df1:
      0     1          2     3
0  Name  Unit  Attribute  Date
1     a     F          1   NaN
2     b     G          2   NaN
3     c     C          3  2016
4     d     D          4  2017
5     e   NaN          5   NaN
df2:
      0     1     2
0  Name  Unit  Date
1     a     F  2019
2     b     G  2020
3     e     H  2021
4     f     I  2022

Thanks in advance for whoever helps.

cs95 · Accepted Answer

Some cleaning:

def clean_df(df):
    df.columns = df.iloc[0]
    df.columns.name = None        
    df = df.iloc[1:].reset_index()

    return df

df1 = clean_df(df1)
df1
   index Name Unit Attribute  Date
0      1    a    A         1  2014
1      2    b    B         2  2015
2      3    c    C         3  2016
3      4    d    D         4  2017
4      5    e    E         5  2018

df2 = clean_df(df2)
df2    
   index Name Unit  Date
0      1    a    F  2019
1      2    b    G  2020
2      3    e    H  2021
3      4    f    I  2022

Use merge, specifying on=Name, so the other columns are not considered.

cols = ['Name', 'Unit_y', 'Attribute', 'Date_y']
df1 = df1.merge(df2, how='left', on='Name')[cols]\
              .rename(columns=lambda x: x.split('_')[0]).fillna(df1)

df1
  Name Unit Attribute  Date
0    a    F         1  2019
1    b    G         2  2020
2    c    C         3  2016
3    d    D         4  2017
4    e    H         5  2021

Merge and update dataframes based on a subset of their columns

Tags:

python

pandas

dataframe

John

1 Answers

cs95

Recent Activity

Donate For Us

Merge and update dataframes based on a subset of their columns

Tags:

python

pandas

dataframe

John

1 Answers

cs95

Related questions

Recent Activity

Donate For Us