Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge and update dataframes based on a subset of their columns

I wonder whether there is the fastest code to replace the two for loops, assuming the df size is very large. In my real case, each dataframe is 200 rows and 25 columns.

data_df1 = np.array([['Name','Unit','Attribute','Date'],['a','A',1,2014],['b','B',2,2015],['c','C',3,2016],\
                 ['d','D',4,2017],['e','E',5,2018]])
data_df2 = np.array([['Name','Unit','Date'],['a','F',2019],['b','G',2020],['e','H',2021],\
                 ['f','I',2022]])
df1 = pd.DataFrame(data=data_df1)
print('df1:')
print(df1)
df2 = pd.DataFrame(data=data_df2)
print('df2:')
print(df2)
row_df1 = [1,2,5]
col_df1 = [1,3]
row_df2 = [1,2,3]
col_df2 = [1,2]
for i in range(0,len(row_df1)):
    for j in range(0, len(col_df1)):
        df1.set_value(row_df1[i],col_df1[j], df2.loc[row_df2[i],col_df2[j]])
print('df1 after operation:')
print(df1)

Expected output:

df1:
      0     1          2     3
0  Name  Unit  Attribute  Date
1     a     A          1  2014
2     b     B          2  2015
3     c     C          3  2016
4     d     D          4  2017
5     e     E          5  2018
df2:
      0     1     2
0  Name  Unit  Date
1     a     F  2019
2     b     G  2020
3     e     H  2021
4     f     I  2022
df1 after operation:
      0     1          2     3
0  Name  Unit  Attribute  Date
1     a     F          1  2019
2     b     G          2  2020
3     c     C          3  2016
4     d     D          4  2017
5     e     H          5  2021

I have tried:

df1.loc[[1,2,5],[1,3]] = df2.loc[[1,2,3],[1,2]]
print('df1:')
print(df1)
print('df2:')
print(df2)

but the outcome is the following. There are unexpected Nan.

df1:
      0     1          2     3
0  Name  Unit  Attribute  Date
1     a     F          1   NaN
2     b     G          2   NaN
3     c     C          3  2016
4     d     D          4  2017
5     e   NaN          5   NaN
df2:
      0     1     2
0  Name  Unit  Date
1     a     F  2019
2     b     G  2020
3     e     H  2021
4     f     I  2022

Thanks in advance for whoever helps.

like image 835
John Avatar asked Sep 19 '25 17:09

John


1 Answers

Some cleaning:

def clean_df(df):
    df.columns = df.iloc[0]
    df.columns.name = None        
    df = df.iloc[1:].reset_index()

    return df

df1 = clean_df(df1)
df1
   index Name Unit Attribute  Date
0      1    a    A         1  2014
1      2    b    B         2  2015
2      3    c    C         3  2016
3      4    d    D         4  2017
4      5    e    E         5  2018

df2 = clean_df(df2)
df2    
   index Name Unit  Date
0      1    a    F  2019
1      2    b    G  2020
2      3    e    H  2021
3      4    f    I  2022

Use merge, specifying on=Name, so the other columns are not considered.

cols = ['Name', 'Unit_y', 'Attribute', 'Date_y']
df1 = df1.merge(df2, how='left', on='Name')[cols]\
              .rename(columns=lambda x: x.split('_')[0]).fillna(df1)

df1
  Name Unit Attribute  Date
0    a    F         1  2019
1    b    G         2  2020
2    c    C         3  2016
3    d    D         4  2017
4    e    H         5  2021
like image 105
cs95 Avatar answered Sep 21 '25 07:09

cs95