I am new to Python’s Pandas. I want to combine several Excel sheets by a common ID. Besides, there it is a one-to-many relationship.
Here are the inputs:
df1:
ID | Name |
---|---|
3763058 | Andi |
3763077 | Mark |
df2:
ID | Tag |
---|---|
3763058 | item1 |
3763058 | item2 |
3763058 | item3 |
3763077 | item4 |
3763077 | item5 |
3763077 | item6 |
I would now like to merge the two pandas data frames df1 and df2 into the following output (the column tag is merged in a single column per ID):
ID | Name | Tag |
---|---|---|
3763058 | Andi | item1, item2, item3 |
3763077 | Mark | item4, item5, item6 |
Could anybody please help me with this?
Cheers, Andi
You can use DataFrame. apply() for concatenate multiple column values into a single column, with slightly less typing and more scalable when you want to join multiple columns .
To start, you may use this template to concatenate your column values (for strings only): df['New Column Name'] = df['1st Column Name'] + df['2nd Column Name'] + ... Notice that the plus symbol ('+') is used to perform the concatenation.
Pandas has optimized operations based on indices, allowing for faster lookup or merging tables based on indices. In the following example we merge the reviews table with the listings table, first using a column to merge on, then using the index. Even when having to set the index, merging on indices is faster.
Both join and merge can be used to combines two dataframes but the join method combines two dataframes on the basis of their indexes whereas the merge method is more versatile and allows us to specify columns beside the index to join on for both dataframes.
You can use first groupby
with join
:
df2 = df2.groupby('ID')['Tag'].apply(', '.join).reset_index()
print (df2)
ID Tag
0 3763058 item1, item2, item3
1 3763077 item_4, item_5, item_6
Then is possible use merge
, especially if df1
has more columns:
df = pd.merge(df1, df2, on='ID', how='left')
print (df)
ID Name Tag
0 3763058 Andi item1, item2, item3
1 3763077 Mark item_4, item_5, item_6
Solution with map
, if need add only one column:
df2 = df2.groupby('ID')['Tag'].apply(', '.join).reset_index()
df2['Name'] = df2['ID'].map(df1.set_index('ID')['Name'])
print (df2)
ID Tag Name
0 3763058 item1, item2, item3 Andi
1 3763077 item_4, item_5, item_6 Mark
If important position of Name
column use insert
:
df2 = df2.groupby('ID')['Tag'].apply(', '.join).reset_index()
df2.insert(1, 'Name', df2['ID'].map(df1.set_index('ID')['Name']))
print (df2)
ID Name Tag
0 3763058 Andi item1, item2, item3
1 3763077 Mark item_4, item_5, item_6
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With