Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas' merge returns a column with _x appended to the name

Tags:

python

pandas

I have to dataframes, df1 has columns A, B, C, D... and df2 has columns A, B, E, F...

The keys I want to merge with are in column A. B is also (most likely) the same in both dataframes. Though this is a big data set I am working on cleaning so I do not have a extremely good overview of everything yet.

I do

merge(df1, df2, on='A') 

And the results contains a column called B_x. Since the data set is big and messy I haven't tried to investigate how B_x differs from B in df1 and B in df2

So my question is just in general: what does Pandas mean when it has appended the _x to a column name in the merged dataframe?

Thank you

like image 675
luffe Avatar asked Apr 21 '14 12:04

luffe


People also ask

How do I get rid of duplicate columns after merge Pandas?

Method #2 – Change the Suffix and Drop the Duplicates One of the parameters of the merge is to apply your own set of suffixes for duplicate columns. This means you label the second DataFrame columns with a keyword that you will use to identify and remove them from the merged DataFrame.

How do I avoid duplicates in Pandas merge?

To concatenate DataFrames, use the concat() method, but to ignore duplicates, use the drop_duplicates() method.

How do I merge Dataframes without duplicating columns?

merge() function to join the left dataframe with the unique column dataframe using 'inner' join. This will ensure that no columns are duplicated in the merged dataset.

How to merge two pandas DataFrames on certain columns?

Merge two Pandas DataFrames on certain columns. We can merge two Pandas DataFrames on certain columns using the merge function by simply specifying the certain columns for merge. Syntax: DataFrame.merge (right, how=’inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, copy=True, indicator=False, ...

Why are there two columns with the same name in pandas?

There are two columns with the same names. Because Pandas DataFrames can’t have columns with the same names, the merge () function appends suffixes to these columns. By default, Pandas uses ('_x', '_y') to differentiate the columns. You can changes these by making use of the suffixes= parameter to modify the suffixes.

What is the difference between concat () and append () methods in pandas?

As the official Pandas documentation points, since concat () and append () methods return new copies of DataFrames, overusing these methods can affect the performance of your program. Append is very useful when you want to merge two DataFrames in row axis only.

How to merge two DataFrames in row axis only?

Append is very useful when you want to merge two DataFrames in row axis only. This means that instead of matching data on their columns, we want a new DataFrame that contains all the rows of 2 DataFrames.


1 Answers

The suffixes are added for any clashes in column names that are not involved in the merge operation, see online docs.

So in your case if you think that they are same you could just do the merge on both columns:

pd.merge(df1, df2, on=['A', 'B']) 

What this will do though is return only the values where A and B exist in both dataframes as the default merge type is an inner merge.

So what you could do is compare this merged df size with your first one and see if they are the same and if so you could do a merge on both columns or just drop/rename the _x/_y suffix B columns.

I would spend time though determining if these values are indeed the same and exist in both dataframes, in which case you may wish to perform an outer merge:

pd.merge(df1, df2, on=['A', 'B'], how='outer') 

Then what you could do is then drop duplicate rows (and possibly any NaN rows) and that should give you a clean merged dataframe.

merged_df.drop_duplicates(cols=['A', 'B'],inplace=True) 

See online docs for drop_duplicates

like image 61
EdChum Avatar answered Sep 20 '22 17:09

EdChum