<pre class="prettyprint"><code>import pandas as pd </code></pre> <hr> <h3>Reproducible setup</h3> <p>I have two dataframes:</p> <pre class="prettyprint"><code>df=\ pd.DataFrame.from_dict({'A':['xy','yx','zy','zz'], 'B':[[1, 3],[4, 3, 5],[3],[2, 6]]}) df2=\ pd.DataFrame.from_dict({'B':[1,3,4,5,6], 'C':['pq','rs','pr','qs','sp']}) </code></pre> <p><code>df</code> looks like:</p> <pre class="prettyprint"><code> A B 0 xy [1, 3] 1 yx [4, 3, 5] 2 zy [3] 3 zz [2, 6] </code></pre> <p><code>df2</code> looks like:</p> <pre class="prettyprint"><code> B C 0 1 pq 1 3 rs 2 4 pr 3 5 qs 4 6 sp </code></pre> <hr> <h3>Aim</h3> <p>I would like to combine these two to form <code>res</code>:</p> <pre class="prettyprint"><code>res=\ pd.DataFrame.from_dict({'A':['xy','yx','zy','zz'], 'C':['pq','pr','rs','sp']}) </code></pre> <p>ie</p> <pre class="prettyprint"><code> A C 0 xy pq 1 yx pr 2 zy rs 3 zz sp </code></pre> <p>The row with <code>xy</code> in <code>df</code> has the lsit <code>[1,3]</code>. There is a row with value <code>1</code> in column <code>B</code> in <code>df2</code>. The <code>C</code> column has value <code>pq</code> in that row, so I combine <code>xy</code> with <code>pq</code>. Same for the next two rows. Last row: there is no value with 2 in column <code>B</code> in <code>df2</code>, so I go for the value <code>6</code> (the last row in <code>df</code> has the list <code>[2,6]</code>).</p> <hr> <h3>Question</h3> <p>How can I achieve this without iterating through the dataframe?</p> <hr> <sub><p>A very similar post in Spanish SO, which inspired this post.</p> </sub>

<p>You can <code>explode</code> "B" into separate rows, then merge on "B" and drop duplicates.</p> <p>Big thanks to Asish M. in the comments for pointing out a potential bug with the ordering.</p> <pre class="prettyprint"><code>(df.explode('B') .merge(df2, on='B', how='left') .dropna(subset=['C']) .drop_duplicates('A')) A B C 0 xy 1 pq 2 yx 4 pr 5 zy 3 rs 7 zz 6 sp </code></pre> <hr> <p>Ideally, the following should have worked:</p> <pre class="prettyprint"><code>df.explode('B').merge(df2).drop_duplicates('A') </code></pre> <p>However, pandas (as of writing, version 1.2dev) does not preserve the ordering of the left keys on a merge which is a bug, see GH18776.</p> <p>In the meantime, we can use the workaround of a left merge as shown above.</p>

How can I combine two dataframes based on a column of lists in Pandas

Tags:

python

merge

pandas

dataframe

import pandas as pd

Reproducible setup

I have two dataframes:

df=\
pd.DataFrame.from_dict({'A':['xy','yx','zy','zz'],
                        'B':[[1, 3],[4, 3, 5],[3],[2, 6]]})

df2=\
pd.DataFrame.from_dict({'B':[1,3,4,5,6],
                        'C':['pq','rs','pr','qs','sp']})

df looks like:

    A          B
0  xy     [1, 3]
1  yx  [4, 3, 5]
2  zy        [3]
3  zz     [2, 6]

df2 looks like:

   B   C
0  1  pq
1  3  rs
2  4  pr
3  5  qs
4  6  sp

Aim

I would like to combine these two to form res:

res=\
pd.DataFrame.from_dict({'A':['xy','yx','zy','zz'],
                        'C':['pq','pr','rs','sp']})

    A   C
0  xy  pq
1  yx  pr
2  zy  rs
3  zz  sp

The row with xy in df has the lsit [1,3]. There is a row with value 1 in column B in df2. The C column has value pq in that row, so I combine xy with pq. Same for the next two rows. Last row: there is no value with 2 in column B in df2, so I go for the value 6 (the last row in df has the list [2,6]).

Question

How can I achieve this without iterating through the dataframe?

_{A very similar post in Spanish SO, which inspired this post.}

796

asked Dec 25 '20 00:12

zabop

1 Answers

You can explode "B" into separate rows, then merge on "B" and drop duplicates.

Big thanks to Asish M. in the comments for pointing out a potential bug with the ordering.

(df.explode('B')
   .merge(df2, on='B', how='left')
   .dropna(subset=['C'])
   .drop_duplicates('A'))

    A  B   C
0  xy  1  pq
2  yx  4  pr
5  zy  3  rs
7  zz  6  sp

Ideally, the following should have worked:

df.explode('B').merge(df2).drop_duplicates('A')

However, pandas (as of writing, version 1.2dev) does not preserve the ordering of the left keys on a merge which is a bug, see GH18776.

In the meantime, we can use the workaround of a left merge as shown above.

169

answered Sep 28 '22 07:09

cs95

Related questions
                            
                                pandas combine stock data if it falls between specific time only in dataframe
                            
                                Is there any way to define a Python function with leading optional arguments?
                            
                                qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "" even though it was found."
                            
                                How to raise a DeprecationWarning when catching an exception with python?
                            
                                KeyError on If-Condition in dictionary Python
                            
                                Is there unstack in NumPy?
                            
                                Flask App not starting (TypeError: code() takes at least 14 arguments (13 given))
                            
                                Pandas loc error: 'Series' objects are mutable, thus they cannot be hashed
                            
                                Using Playwright for Python, how do I select an option from a drop down list?
                            
                                How to melt a dataframe while doing some operation?
                            
                                How can I make a distance matrix with own metric using no loop?
                            
                                Does Pytest cache fixture data when called by multiple test functions?
                            
                                How to create sum of columns in Pandas based on a conditional of multiple columns?
                            
                                Plotting two dataframes obtained from a loop in the same graph Python
                            
                                AttributeError: 'NoneType' object has no attribute 'excluded_of'
                            
                                trying to find the current project id of the deployed python function in google cloud gives error
                            
                                How do I turn off the "Evaluating: plt.show() did not finish after 3.00s seconds." warning in the VsCode debugger?
                            
                                How to view opts for Holoviews with Bokeh in Python
                            
                                How to handle job cancelation in Slurm?
                            
                                How to find the range of dates from a datetime column in a dataframe?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With