I am trying to merge/join two csv's, based on a unique city
/country
/state
column combination using Pandas. However, when I try to do this using an outer join, I am getting extra columns when instead I would prefer to have the "right" side of my join overwrite the columns in the "left" side of the join. Any suggestions?
Here is my attempt, with an example:
These are my csv's:
My "left" csv file:
| city | country | state | pop | lat | long |
|--------------+---------+-------+----------+---------+---------|
| beijing | cn | 22 | 456 | 456 | 456 |
| buenos aires | ar | 7 | 13076300 | -34.613 | -58.377 |
| mexico city | mx | 9 | 123 | 123 | 123 |
My "right" csv file:
| city | country | state | pop | lat | long |
|-------------+---------+-------+----------+-----------+------------|
| adamsville | us | al | 4400 | 33.60575 | -86.97465 |
| alabaster | us | al | 32707 | 33.219442 | -86.823907 |
| beijing | cn | 22 | 11716620 | 39.907 | 116.397 |
| mexico city | mx | 9 | 12294193 | 19.428 | -99.128 |
and I want this result:
| city | country | state | pop | lat | long |
|--------------+---------+-------+----------+-----------+------------|
| adamsville | us | al | 4400 | 33.60575 | -86.97465 |
| alabaster | us | al | 32707 | 33.219442 | -86.823907 |
| beijing | cn | 22 | 11716620 | 39.907 | 116.397 |
| buenos aires | ar | 7 | 13076300 | -34.613 | -58.377 |
| mexico city | mx | 9 | 12294193 | 19.428 | -99.128 |
Note that mexico city
and beijing
are considered matches, based on their city
, country
, and state
columns. Also note that on these matching rows, each column from my "left" csv is overwritten by the matching column from my "right" csv.
So here is my attempt using Pandas and dataframes:
left = pd.read_csv('left.csv')
right = pd.read_csv('right.csv')
result = pd.merge(left, right, on=['city', 'country', 'state'], how='outer')
Unfortunately, here is my result:
| city | country | state | pop_x | lat_x | long_x | pop_y | lat_y | long_y |
|--------------+---------+-------+----------+-----------+------------+----------+-----------+------------|
| adamsville | us | al | 4400 | 33.60575 | -86.97465 | 4400 | 33.60575 | -86.97465 |
| alabaster | us | al | 32707 | 33.219442 | -86.823907 | 32707 | 33.219442 | -86.823907 |
| albertville | us | al | | 34.26313 | -86.21066 | | 34.26313 | -86.21066 |
| beijing | cn | 22 | 456 | 456 | 456 | 11716620 | 39.907 | 116.397 |
| buenos aires | ar | 7 | 13076300 | -34.613 | -58.377 | 13076300 | -34.613 | -58.377 |
| mexico city | mx | 9 | 123 | 123 | 123 | 12294193 | 19.428 | -99.128 |
| mumbai | in | 16 | 12691836 | 19.073 | 72.883 | 12691836 | 19.073 | 72.883 |
| shanghai | cn | 23 | 22315474 | 31.222 | 121.458 | 22315474 | 31.222 | 121.458 |
As shown above, the columns that are not being used for the join, and which have the same name, are renamed with a _x
suffix for the "left" dataframe and a _y
suffix for the "right" dataframe.
Is there a simple way to make the columns from the "right" dataframe to overwrite the columns from the "left" dataframe when matched?
Although there seem to be similar questions already out there, I still can't seem to find an answer. For example, I tried implementing the solution based on this question:
left = pd.read_csv('left.csv')
right = pd.read_csv('right.csv')
left = left.set_index(['city','country','state'])
right = right.set_index(['city','country','state'])
left.update(right)
But update
only performs left joins, so the resulting dataframe only has the same rows from the left dataframe, so it is missing cities like adamsville
and alabaster
above.
Since the column names for both dataframes are the same you could stack them and then do a drop_duplicates or groupby
For example:
result = pd.concat([left, right]).reset_index()
result.drop_duplicates(['city','country','state'], keep='first', inplace=True)
or:
df_stacked = pd.concat([left, right]).reset_index()
result = df_stacked.groupby(['city','country','state']).first()
Calling first will take the values from the "left" df over the "right" df because we're stacking the "left" df on top of the "right" df and resetting the index
Using groupby will allow you to perform more complex selects on the aggregated records if you don't want to just take the first or last record.
EDIT:
Just realized you want the "right" df to overwrite the "left" df, in that case...
df_stacked = pd.concat([right, left]).reset_index()
result = df_stacked.groupby(['city','country','state']).first()
This methodology only works if the "left" and "right" dataframes don't contain duplicate records to start.
And for the record, to get to the csv solution in the example above, we can perform the following:
result = result.reset_index()
# sort our descending population, and if populations are equal (or NaN), sort by ascending city name
result = result.sort_values(['pop', 'city'], ascending=[False, True])
result.drop('index', axis=1, inplace=True)
result.to_csv('result.csv', index=False)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With