Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging pandas columns (one-to-many)

I am new to Python’s Pandas. I want to combine several Excel sheets by a common ID. Besides, there it is a one-to-many relationship.

Here are the inputs:

df1:

ID Name
3763058 Andi
3763077 Mark

df2:

ID Tag
3763058 item1
3763058 item2
3763058 item3
3763077 item4
3763077 item5
3763077 item6

I would now like to merge the two pandas data frames df1 and df2 into the following output (the column tag is merged in a single column per ID):

ID Name Tag
3763058 Andi item1, item2, item3
3763077 Mark item4, item5, item6

Could anybody please help me with this?

Cheers, Andi

like image 948
Andi Maier Avatar asked Jun 30 '17 08:06

Andi Maier


People also ask

How do I combine multiple columns into one panda?

You can use DataFrame. apply() for concatenate multiple column values into a single column, with slightly less typing and more scalable when you want to join multiple columns .

How do I combine column values in pandas?

To start, you may use this template to concatenate your column values (for strings only): df['New Column Name'] = df['1st Column Name'] + df['2nd Column Name'] + ... Notice that the plus symbol ('+') is used to perform the concatenation.

How is pandas merge so fast?

Pandas has optimized operations based on indices, allowing for faster lookup or merging tables based on indices. In the following example we merge the reviews table with the listings table, first using a column to merge on, then using the index. Even when having to set the index, merging on indices is faster.

What is the difference between join () and merge () in pandas?

Both join and merge can be used to combines two dataframes but the join method combines two dataframes on the basis of their indexes whereas the merge method is more versatile and allows us to specify columns beside the index to join on for both dataframes.


1 Answers

You can use first groupby with join:

df2 = df2.groupby('ID')['Tag'].apply(', '.join).reset_index()
print (df2)
        ID                     Tag
0  3763058     item1, item2, item3
1  3763077  item_4, item_5, item_6

Then is possible use merge, especially if df1 has more columns:

df = pd.merge(df1, df2, on='ID', how='left')
print (df)
        ID  Name                     Tag
0  3763058  Andi     item1, item2, item3
1  3763077  Mark  item_4, item_5, item_6

Solution with map, if need add only one column:

df2 = df2.groupby('ID')['Tag'].apply(', '.join).reset_index()
df2['Name'] = df2['ID'].map(df1.set_index('ID')['Name'])
print (df2)
        ID                     Tag  Name
0  3763058     item1, item2, item3  Andi
1  3763077  item_4, item_5, item_6  Mark

If important position of Name column use insert:

df2 = df2.groupby('ID')['Tag'].apply(', '.join).reset_index()
df2.insert(1, 'Name', df2['ID'].map(df1.set_index('ID')['Name']))
print (df2)
        ID  Name                     Tag
0  3763058  Andi     item1, item2, item3
1  3763077  Mark  item_4, item_5, item_6
like image 132
jezrael Avatar answered Sep 28 '22 07:09

jezrael