I have to dataframes, df1 has columns A, B, C, D... and df2 has columns A, B, E, F...
The keys I want to merge with are in column A. B is also (most likely) the same in both dataframes. Though this is a big data set I am working on cleaning so I do not have a extremely good overview of everything yet.
I do
merge(df1, df2, on='A')
And the results contains a column called B_x. Since the data set is big and messy I haven't tried to investigate how B_x differs from B in df1 and B in df2
So my question is just in general: what does Pandas mean when it has appended the _x to a column name in the merged dataframe?
Thank you
Method #2 – Change the Suffix and Drop the Duplicates One of the parameters of the merge is to apply your own set of suffixes for duplicate columns. This means you label the second DataFrame columns with a keyword that you will use to identify and remove them from the merged DataFrame.
To concatenate DataFrames, use the concat() method, but to ignore duplicates, use the drop_duplicates() method.
merge() function to join the left dataframe with the unique column dataframe using 'inner' join. This will ensure that no columns are duplicated in the merged dataset.
Merge two Pandas DataFrames on certain columns. We can merge two Pandas DataFrames on certain columns using the merge function by simply specifying the certain columns for merge. Syntax: DataFrame.merge (right, how=’inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, copy=True, indicator=False, ...
There are two columns with the same names. Because Pandas DataFrames can’t have columns with the same names, the merge () function appends suffixes to these columns. By default, Pandas uses ('_x', '_y') to differentiate the columns. You can changes these by making use of the suffixes= parameter to modify the suffixes.
As the official Pandas documentation points, since concat () and append () methods return new copies of DataFrames, overusing these methods can affect the performance of your program. Append is very useful when you want to merge two DataFrames in row axis only.
Append is very useful when you want to merge two DataFrames in row axis only. This means that instead of matching data on their columns, we want a new DataFrame that contains all the rows of 2 DataFrames.
The suffixes are added for any clashes in column names that are not involved in the merge operation, see online docs.
So in your case if you think that they are same you could just do the merge on both columns:
pd.merge(df1, df2, on=['A', 'B'])
What this will do though is return only the values where A
and B
exist in both dataframes as the default merge type is an inner
merge.
So what you could do is compare this merged df size with your first one and see if they are the same and if so you could do a merge on both columns or just drop/rename the _x
/_y
suffix B
columns.
I would spend time though determining if these values are indeed the same and exist in both dataframes, in which case you may wish to perform an outer
merge:
pd.merge(df1, df2, on=['A', 'B'], how='outer')
Then what you could do is then drop duplicate rows (and possibly any NaN
rows) and that should give you a clean merged dataframe.
merged_df.drop_duplicates(cols=['A', 'B'],inplace=True)
See online docs for drop_duplicates
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With