I have an old dataframe (dfo) that someone decided to add additional columns (notes) to and this data set does not have a key column. I also have a new dataframe (dfn) that is suppose to represent the same data but does not have the notes columns. I was asked just to transfer the old notes to the new dataframe. I have been able to get matches for some rows but not all. What I want to is to find if there are additional tricks to merging on multiple columns or is there alternatives that might fit better.
below is example data from the original csv that did not merge then placing it in the Dictionaries it works just fine.
example_new = {'S1': ['W', 'CD', 'W', 'W', 'CD', 'W', 'CD'],
'DateTime': ['6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26'],
'N1': ['N', 'Y', 'N', 'Y', 'N', 'N', 'N'],
'AC': ['C253', '100', '1064', '1920', '1996', '100', 'C253'],
'PS': ['C_041', 'C_041', 'C_041', 'C_041', 'C_041', 'C_041', 'C_041'],
'TI': ['14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP'],
'U': [' ', 'N', 'U/C', 'T', 'C', 'N', 'P'],
'LN': ['Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie'],
'R2': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]}
example_old = {'S1': ['W', 'W', 'W', 'W', 'CD', 'CD'],
'DateTime': ['6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26', '6/9/2021 13:26'],
'N1': ['N', 'Y', 'N', 'N', 'N', 'Y'],
'AC': ['1064', '1920', 'C253', '100', 'C253', '100'],
'PS': ['C_041', 'C_041', 'C_041', 'C_041', 'C_041', 'C_041'],
'TI': ['14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP', '14-2-EP'],
'U': ['U/C', 'T', ' ', 'N', 'P', 'N'],
'LN': ['Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie', 'Eddie'],
'R2': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Note1': ['Y', 'Y', 'Y', 'Y', 'N', 'N']}
dfo = pd.DataFrame.from_dict(example_old)
dfn = pd.DataFrame.from_dict(example_new)
dfn['DateTime'] = pd.to_datetime(dfnt['DateTime'])
dfo['DateTime'] = pd.to_datetime(dfot['DateTime'])
The code:
dfo = dfo # shape (10250, 10) the additional columns are notes.
# columns: ['S1', 'DateTime', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2', 'Note1']
dfn = dfn # shape (13790, 9) there are a lot or corrects to the prior data
# and additional new data.
# columns: ['S1', 'DateTime', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']
# to make sure that the dtypes are the same.
# I read that making sure the object columns are all strings works better. Really Good tip!!
str_col = ['S1', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']
dfo[str_col] = dfo[str_col].astype(str)
dfn[str_col] = dfn[str_col].astype(str)
dfo = dfo.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
dfn = dfn.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
# I read that encoding the columns might show characters that are hidden.
# I did not find this helpful for my data.
# u = dfo.select_dtypes(include=[object])
# dfo[u.columns] = u.apply(lambda x: x.str.encode('utf-8'))
# u = dfn.select_dtypes(include=[object])
# dfn[u.columns] = u.apply(lambda x: x.str.encode('utf-8'))
# test / check the dtypes
otypes = dfo.columns.to_series().groupby(dfo.dtypes).groups
ntypes = dfn.columns.to_series().groupby(dfn.dtypes).groups
# display results... dtypes
In [95]: print(otypes)
Out[74]: {datetime64[ns]: ['DateTime'],
object: ['S1', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2', 'Note1']}
In [82]: print(ntypes)
Out[82]: {datetime64[ns]: ['DateTime'],
object: ['S1', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']}
# Time to merge
subset = ['S1', 'DateTime', 'N1', 'AC', 'PS', 'TI', 'U', 'LN', 'R2']
dfm = pd.merge(dfn,dfo, how="left", on=subset)
About 75% of the data is merging. I have done spot checks and there is a lot more data that could merge but it is not. What else should I do to get the remaining 15~25% to merge? If you want to see the data in the csv file I have included a link. Github to csv files
I suggest to do an outer
merge of both DataFrames with the indicator
option set, to get the origin. Afterwards you can visualize the data interactively with plotly
. Or you can group on a subset of columns and explore the differences.
import pandas as pd
import os
path = "~/Downloads/merging_dataframes-main"
dfo = pd.read_csv(os.path.expanduser(os.path.join(path, "df_old.csv")))
dfn = pd.read_csv(os.path.expanduser(os.path.join(path, "df_new.csv")))
dfn['DateTime'] = pd.to_datetime(dfn['DateTime'])
dfo['DateTime'] = pd.to_datetime(dfo['DateTime'])
outer
merge of both DataFramesdf_merge = pd.merge(dfo, dfn, how='outer', indicator=True)
columns_of_interest = df_merge.columns.difference(['DateTime', '_merge'])
df_merge.sort_values(['DateTime', *columns_of_interest], ignore_index=True, inplace=True)
mask_matches = df_merge._merge == 'both'
# # in case you want to rename the indicator
# df_merge._merge = df_merge._merge.cat.rename_categories({'left_only': 'old', 'right_only': 'new'})
run the code in an jupyter notebook and zoom in to some points of interest. The y-axis are the indexes. Hover with our mouse over a point to get all parameters of a point.
import plotly.express as px
px.scatter(df_merge[~mask_matches],
x='DateTime',
color='_merge',
hover_data=df_merge[~mask_matches].columns.difference(['DateTime', '_merge']))
groupby
to get the differencegroupby
search for unique pairs in the defined columns. Here I use only 'DateTime' as it makes most sense for me here. But you can do it also with multiple columns. Afterwards, I create one DataFrame out of all groups which have both types listed in _merge
, i.e.: left_only and right_only.
gb = df_merge[~mask_matches].groupby('DateTime')
groups = []
i = 0
for group_i in gb.groups:
df_i = gb.get_group(group_i).copy()
if len(gb.get_group(group_i)._merge.unique()) != 1:
df_i['group'] = i
i += 1
groups.append(df_i)
df_same_time = pd.concat(groups)
S1 | DateTime | N1 | AC | PS | TI | U | LN | R2 | Note1 | _merge | group | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
40 | CD | 2018-02-01 11:37:00 | N | 1005 | C_031 | 15-85-SR | U/L | Eaton | nan | Y | left_only | 0 |
41 | W | 2018-02-01 11:37:00 | N | 1930 | C_031 | 15-85-SR | M/L | Eaton | nan | nan | right_only | 0 |
46 | CD | 2018-02-01 15:15:00 | N | 1005 | C_007 | 15-82-SR | U/L | Eaton | nan | Y | left_only | 1 |
47 | W | 2018-02-01 15:15:00 | N | 1930 | C_007 | 15-82-SR | M/L | Eaton | nan | nan | right_only | 1 |
... |
or explore the differences
for group_i in gb.groups:
df_i = gb.get_group(group_i)
if len(df_i._merge.unique()) != 1:
print('\n', df_i["DateTime"].iloc[0])
print(f'_merge\t: {df_i["_merge"].values.astype(str)}')
for i in df_i.columns.difference(['Note1', '_merge']):
if len(df_i[i].unique()) > 1:
print(f'{i}\t: {df_i[i].values}')
where you see this entry quite a lot
_merge : ['left_only' 'right_only']
AC : ['1005' '1930']
S1 : ['CD' 'W']
U : ['U/L' 'M/L']
px.scatter(df_same_time,
x='DateTime',
color='_merge',
hover_data=df_same_time.columns.difference(['DateTime', '_merge']))
If I understand correctly the ’25% that do not merge’ refers to the length of the data frame that results from an inner join:
print('inner', pd.merge(dfn, dfo, how="inner", on=subset).shape) # (7445, 10)
Which means that roughly 2700 rows of the old data frame are not merged with the new data frame.
It seems to me your code works fine. It is just that these ‘missing’ rows in the old data frame do not have a counterpart in the new data frame.
If you create a data frame of data that did not merge this can be checked:
dfi = pd.merge(dfn, dfo, how="inner", on=subset)
dfi['inner'] = 1
df_o_nm = pd.merge(dfo, dfi, how="left", on=subset)
df_o_nm = df_oi.loc[df_oi['inner'] != 1][subset] # not merged data from old data frame
Take for example the first row in df_o_nm
:
S1 DateTime N1 AC PS TI U LN R2
CD 2018-02-01 11:37:00 N 1005 C_031 15-85-SR U/L Eaton nan
If only a subset of columns is compared (specifically S1
, N1
, AC
, PS
, TI
, U
), exactly one potential row can be found in df_n
, which has a different DateTime
and LN
value:
S1 DateTime N1 AC PS TI U LN R2
CD 2020-07-01 12:59:00 N 1005 C_031 15-85-SR U/L Bob nan
Differences in the 2 data frames seem to be mostly caused by the columns DateTime
, LN
, PS
and TI
. So a more thorough (still simple) comparison can be done like this:
for index, row in df_o_nm.iterrows():
df_sel = dfn.loc[
(dfn['S1']==row['S1']) &
(dfn['N1']==row['N1']) &
(dfn['AC']==row['AC']) &
(dfn['U']==row['U'])]
if len(df_sel) == 0:
print('no matching data subset found.')
else:
print(f'{len(df_sel)} rows matching subset of columns found')
for idx, row_sel in df_sel.iterrows():
for col in ['DateTime', 'LN', 'PS', 'TI']:
if row[col] != row_sel[col]:
print(f"{idx}: {col}: {row[col]} --> {row_sel[col]}")
print('---')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With