Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas how to Merging on multiple columns not working and other solutions regular Merge

I have an old dataframe (dfo) that someone decided to add additional columns (notes) to and this data set does not have a key column. I also have a new dataframe (dfn) that is suppose to represent the same data but does not have the notes columns. I was asked just to transfer the old notes to the new dataframe. I have been able to get matches for some rows but not all. What I want to is to find if there are additional tricks to merging on multiple columns or is there alternatives that might fit better.

below is example data from the original csv that did not merge then placing it in the Dictionaries it works just fine.

example_new = {'S1': ['W', 'CD', 'W', 'W', 'CD', 'W', 'CD'], 
'DateTime': ['6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26'], 
'N1': ['N', 'Y', 'N', 'Y', 'N', 'N', 'N'], 
'AC': ['C253', '100', '1064', '1920', '1996', '100', 'C253'], 
'PS': ['C_041', 'C_041', 'C_041', 'C_041', 'C_041', 'C_041', 'C_041'], 
'TI': ['14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP'], 
'U': [' ', 'N', 'U/C', 'T', 'C', 'N', 'P'], 
'LN': ['Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie'], 
'R2': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]}

example_old = {'S1': ['W', 'W', 'W', 'W', 'CD', 'CD'], 
'DateTime': ['6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26'],
'N1': ['N', 'Y', 'N', 'N', 'N', 'Y'], 
'AC': ['1064', '1920', 'C253', '100', 'C253', '100'], 
'PS': ['C_041', 'C_041', 'C_041', 'C_041', 'C_041', 'C_041'], 
'TI': ['14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP'], 
'U': ['U/C', 'T', ' ', 'N', 'P', 'N'], 
'LN': ['Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie'], 
'R2': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan], 
'Note1': ['Y', 'Y', 'Y', 'Y', 'N', 'N']}

dfo = pd.DataFrame.from_dict(example_old)
dfn = pd.DataFrame.from_dict(example_new)
dfn['DateTime'] = pd.to_datetime(dfnt['DateTime'])
dfo['DateTime'] = pd.to_datetime(dfot['DateTime'])

The code:

dfo = dfo # shape (10250, 10) the additional columns are notes. 
# columns: ['S1', 'DateTime', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2', 'Note1']

dfn = dfn # shape (13790, 9) there are a lot or corrects to the prior data 
# and additional new data.
# columns: ['S1', 'DateTime', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']

# to make sure that the dtypes are the same.
# I read that making sure the object columns are all strings works better. Really Good tip!!
str_col = ['S1', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']
dfo[str_col] = dfo[str_col].astype(str)
dfn[str_col] = dfn[str_col].astype(str)
dfo = dfo.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
dfn = dfn.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

# I read that encoding the columns might show characters that are hidden.
# I did not find this helpful for my data. 
# u = dfo.select_dtypes(include=[object])
# dfo[u.columns] = u.apply(lambda x: x.str.encode('utf-8'))
# u = dfn.select_dtypes(include=[object])
# dfn[u.columns] = u.apply(lambda x: x.str.encode('utf-8'))

# test / check the dtypes
otypes = dfo.columns.to_series().groupby(dfo.dtypes).groups
ntypes = dfn.columns.to_series().groupby(dfn.dtypes).groups

# display results... dtypes

In [95]: print(otypes)
Out[74]: {datetime64[ns]: ['DateTime'], 
       object: ['S1', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2', 'Note1']}

In [82]: print(ntypes)
Out[82]: {datetime64[ns]: ['DateTime'], 
       object: ['S1', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']}

# Time to merge 
subset = ['S1', 'DateTime', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']
dfm = pd.merge(dfn,dfo, how="left", on=subset)

About 75% of the data is merging. I have done spot checks and there is a lot more data that could merge but it is not. What else should I do to get the remaining 15~25% to merge? If you want to see the data in the csv file I have included a link. Github to csv files

like image 636
Shane S Avatar asked Sep 03 '25 17:09

Shane S


2 Answers

I suggest to do an outer merge of both DataFrames with the indicator option set, to get the origin. Afterwards you can visualize the data interactively with plotly. Or you can group on a subset of columns and explore the differences.

Get the DataFrames: Github to csv files

import pandas as pd
import os

path = "~/Downloads/merging_dataframes-main"
dfo = pd.read_csv(os.path.expanduser(os.path.join(path, "df_old.csv")))
dfn = pd.read_csv(os.path.expanduser(os.path.join(path, "df_new.csv")))

dfn['DateTime'] = pd.to_datetime(dfn['DateTime'])
dfo['DateTime'] = pd.to_datetime(dfo['DateTime'])

Do an outer merge of both DataFrames

df_merge = pd.merge(dfo, dfn, how='outer', indicator=True)

columns_of_interest = df_merge.columns.difference(['DateTime', '_merge'])
df_merge.sort_values(['DateTime', *columns_of_interest], ignore_index=True, inplace=True)

mask_matches = df_merge._merge == 'both'

# # in case you want to rename the indicator
# df_merge._merge = df_merge._merge.cat.rename_categories({'left_only': 'old', 'right_only': 'new'})

Option 1: Visualize the merge

run the code in an jupyter notebook and zoom in to some points of interest. The y-axis are the indexes. Hover with our mouse over a point to get all parameters of a point.

import plotly.express as px

px.scatter(df_merge[~mask_matches], 
           x='DateTime', 
           color='_merge',
           hover_data=df_merge[~mask_matches].columns.difference(['DateTime', '_merge']))

interactive plotly window

Option 2: Use groupby to get the difference

groupby search for unique pairs in the defined columns. Here I use only 'DateTime' as it makes most sense for me here. But you can do it also with multiple columns. Afterwards, I create one DataFrame out of all groups which have both types listed in _merge, i.e.: left_only and right_only.

gb = df_merge[~mask_matches].groupby('DateTime')
groups = []
i = 0
for group_i in gb.groups:
    df_i = gb.get_group(group_i).copy()
    if len(gb.get_group(group_i)._merge.unique()) != 1:
        df_i['group'] = i
        i += 1
        groups.append(df_i)
        
df_same_time = pd.concat(groups)
S1 DateTime N1 AC PS TI U LN R2 Note1 _merge group
40 CD 2018-02-01 11:37:00 N 1005 C_031 15-85-SR U/L Eaton nan Y left_only 0
41 W 2018-02-01 11:37:00 N 1930 C_031 15-85-SR M/L Eaton nan nan right_only 0
46 CD 2018-02-01 15:15:00 N 1005 C_007 15-82-SR U/L Eaton nan Y left_only 1
47 W 2018-02-01 15:15:00 N 1930 C_007 15-82-SR M/L Eaton nan nan right_only 1
...

or explore the differences

for group_i in gb.groups:
    df_i = gb.get_group(group_i)
    if len(df_i._merge.unique()) != 1:
        print('\n', df_i["DateTime"].iloc[0])
        print(f'_merge\t: {df_i["_merge"].values.astype(str)}')
        for i in df_i.columns.difference(['Note1', '_merge']):
            if len(df_i[i].unique()) > 1:
                print(f'{i}\t: {df_i[i].values}')

where you see this entry quite a lot

_merge  : ['left_only' 'right_only']
AC      : ['1005' '1930']
S1      : ['CD' 'W']
U       : ['U/L' 'M/L']

and/or visualize it with plotly again

px.scatter(df_same_time, 
           x='DateTime', 
           color='_merge',
           hover_data=df_same_time.columns.difference(['DateTime', '_merge']))
like image 93
kho Avatar answered Sep 06 '25 12:09

kho


If I understand correctly the ’25% that do not merge’ refers to the length of the data frame that results from an inner join:

print('inner', pd.merge(dfn, dfo, how="inner", on=subset).shape)  # (7445, 10)

Which means that roughly 2700 rows of the old data frame are not merged with the new data frame.

It seems to me your code works fine. It is just that these ‘missing’ rows in the old data frame do not have a counterpart in the new data frame.

If you create a data frame of data that did not merge this can be checked:

dfi = pd.merge(dfn, dfo, how="inner", on=subset)
dfi['inner'] = 1
df_o_nm = pd.merge(dfo, dfi, how="left", on=subset)
df_o_nm = df_oi.loc[df_oi['inner'] != 1][subset]  # not merged data from old data frame

Take for example the first row in df_o_nm:

S1 DateTime            N1    AC     PS        TI    U     LN   R2
CD 2018-02-01 11:37:00  N  1005  C_031  15-85-SR  U/L  Eaton  nan

If only a subset of columns is compared (specifically S1, N1, AC, PS, TI, U), exactly one potential row can be found in df_n, which has a different DateTime and LN value:

S1 DateTime             N1    AC     PS        TI    U   LN   R2
CD 2020-07-01 12:59:00  N    1005  C_031  15-85-SR  U/L  Bob  nan

Differences in the 2 data frames seem to be mostly caused by the columns DateTime, LN, PS and TI. So a more thorough (still simple) comparison can be done like this:

for index, row in df_o_nm.iterrows():
    df_sel = dfn.loc[
        (dfn['S1']==row['S1']) & 
        (dfn['N1']==row['N1']) & 
        (dfn['AC']==row['AC']) & 
        (dfn['U']==row['U'])]
    if len(df_sel) == 0:
        print('no matching data subset found.')
    else:
        print(f'{len(df_sel)} rows matching subset of columns found')
        for idx, row_sel in df_sel.iterrows():
            for col in ['DateTime', 'LN', 'PS', 'TI']:
                if row[col] != row_sel[col]:
                    print(f"{idx}:   {col}: {row[col]} --> {row_sel[col]}")
                print('---')
like image 31
rosa b. Avatar answered Sep 06 '25 13:09

rosa b.