<p>I am new to Python’s Pandas. I want to combine several Excel sheets by a common ID. Besides, there it is a one-to-many relationship.</p> <p>Here are the inputs:</p> <p><strong>df1</strong>:</p> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th>ID</th> <th>Name</th> </tr></thead> <tbody> <tr> <td>3763058</td> <td>Andi</td> </tr> <tr> <td>3763077</td> <td>Mark</td> </tr> </tbody> </table> </div> <p><strong>df2</strong>:</p> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th>ID</th> <th>Tag</th> </tr></thead> <tbody> <tr> <td>3763058</td> <td>item1</td> </tr> <tr> <td>3763058</td> <td>item2</td> </tr> <tr> <td>3763058</td> <td>item3</td> </tr> <tr> <td>3763077</td> <td>item4</td> </tr> <tr> <td>3763077</td> <td>item5</td> </tr> <tr> <td>3763077</td> <td>item6</td> </tr> </tbody> </table> </div> <p>I would now like to merge the two pandas data frames df1 and df2 into the following output (the column tag is merged in a single column per ID):</p> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th>ID</th> <th>Name</th> <th>Tag</th> </tr></thead> <tbody> <tr> <td>3763058</td> <td>Andi</td> <td>item1, item2, item3</td> </tr> <tr> <td>3763077</td> <td>Mark</td> <td>item4, item5, item6</td> </tr> </tbody> </table> </div> <p>Could anybody please help me with this?</p> <p>Cheers, Andi</p>

<p>You can use first <code>groupby</code> with <code>join</code>:</p> <pre class="prettyprint"><code>df2 = df2.groupby('ID')['Tag'].apply(', '.join).reset_index() print (df2) ID Tag 0 3763058 item1, item2, item3 1 3763077 item_4, item_5, item_6 </code></pre> <p>Then is possible use <code>merge</code>, especially if <code>df1</code> has more columns:</p> <pre class="prettyprint"><code>df = pd.merge(df1, df2, on='ID', how='left') print (df) ID Name Tag 0 3763058 Andi item1, item2, item3 1 3763077 Mark item_4, item_5, item_6 </code></pre> <p>Solution with <code>map</code>, if need add only one column:</p> <pre class="prettyprint"><code>df2 = df2.groupby('ID')['Tag'].apply(', '.join).reset_index() df2['Name'] = df2['ID'].map(df1.set_index('ID')['Name']) print (df2) ID Tag Name 0 3763058 item1, item2, item3 Andi 1 3763077 item_4, item_5, item_6 Mark </code></pre> <p>If important position of <code>Name</code> column use <code>insert</code>:</p> <pre class="prettyprint"><code>df2 = df2.groupby('ID')['Tag'].apply(', '.join).reset_index() df2.insert(1, 'Name', df2['ID'].map(df1.set_index('ID')['Name'])) print (df2) ID Name Tag 0 3763058 Andi item1, item2, item3 1 3763077 Mark item_4, item_5, item_6 </code></pre>

Merging pandas columns (one-to-many)

Tags:

python

pandas

excel

I am new to Python’s Pandas. I want to combine several Excel sheets by a common ID. Besides, there it is a one-to-many relationship.

Here are the inputs:

df1:

ID	Name
3763058	Andi
3763077	Mark

df2:

ID	Tag
3763058	item1
3763058	item2
3763058	item3
3763077	item4
3763077	item5
3763077	item6

I would now like to merge the two pandas data frames df1 and df2 into the following output (the column tag is merged in a single column per ID):

ID	Name	Tag
3763058	Andi	item1, item2, item3
3763077	Mark	item4, item5, item6

Could anybody please help me with this?

Cheers, Andi

948

asked Jun 30 '17 08:06

Andi Maier

1 Answers

You can use first groupby with join:

df2 = df2.groupby('ID')['Tag'].apply(', '.join).reset_index()
print (df2)
        ID                     Tag
0  3763058     item1, item2, item3
1  3763077  item_4, item_5, item_6

Then is possible use merge, especially if df1 has more columns:

df = pd.merge(df1, df2, on='ID', how='left')
print (df)
        ID  Name                     Tag
0  3763058  Andi     item1, item2, item3
1  3763077  Mark  item_4, item_5, item_6

Solution with map, if need add only one column:

df2 = df2.groupby('ID')['Tag'].apply(', '.join).reset_index()
df2['Name'] = df2['ID'].map(df1.set_index('ID')['Name'])
print (df2)
        ID                     Tag  Name
0  3763058     item1, item2, item3  Andi
1  3763077  item_4, item_5, item_6  Mark

If important position of Name column use insert:

df2 = df2.groupby('ID')['Tag'].apply(', '.join).reset_index()
df2.insert(1, 'Name', df2['ID'].map(df1.set_index('ID')['Name']))
print (df2)
        ID  Name                     Tag
0  3763058  Andi     item1, item2, item3
1  3763077  Mark  item_4, item_5, item_6

132

answered Sep 28 '22 07:09

jezrael

Related questions
                            
                                Pandas: Get corresponding column value in row based on unique value
                            
                                AttributeError list object has no attribute add
                            
                                How to save the data from a scrapy crawler into a variable?
                            
                                Reading zipped JSON files
                            
                                Click and drag a rectangle with pygame
                            
                                Choose matplotlib xticks frequency
                            
                                Python Set Firefox Preferences for Selenium--Download Location
                            
                                Get basename of a Windows path in Linux
                            
                                How to respect PEP8 when accessing multiple nested dictionaries?
                            
                                How can I mock a module that is imported from a function and not present in sys.path? [duplicate]
                            
                                Type Conversion in python AttributeError: 'str' object has no attribute 'astype'
                            
                                Adding specific lines to a Plotly Scatter3d() plot
                            
                                Connection reset by Peer pymongo
                            
                                datasets.load_iris() in Python
                            
                                Join dataframes - one with multiindex columns and the other without
                            
                                Python script should end with new line or not ? Pylint contradicting itself?
                            
                                Python Pandas: TypeError: unsupported operand type(s) for +: 'datetime.time' and 'Timedelta'
                            
                                How can I do a Monte Carlo analysis on an equation?
                            
                                statespace.SARIMAX model: why the model use all the data to train mode, and predict the a range of train model
                            
                                How to do a cumulative "all"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With