<pre class="prettyprint"><code>import pandas as pd import numpy as np dict1 = {'col1': ['A', 'A', 'A', 'A', 'A','B', 'B', 'B', 'B', 'B' ], 'col2':[2, 2, 2, 3, 3, 2, 2, 3, 3 , 3], 'col3':[0.7, 0.8, 0.9, 0.95, 0.85, 0.65, 0.75, 0.45, 0.55, 0.75 ], 'col4':[100,200,300,400,500,600,700,800,900,1000]} df1 = pd.DataFrame(data=dict1) df1 dict2 = {'col1': ['A', 'B' ], 'col2':[0.75, 0.65], 'col3':[1000, 2000 ], 'col4':[0.8, 0.9]} df2 = pd.DataFrame(data=dict2) df2 </code></pre> In fastest way how to filter df1 using df2, depending on df1['col3'] >= df2['col2'] for equal col1s? Intended outcome <pre class="prettyprint"><code>>>> df1 col1 col2 col3 col4 1 A 2 0.80 200 2 A 2 0.90 300 3 A 3 0.95 400 4 A 3 0.85 500 5 B 2 0.65 600 6 B 2 0.75 700 9 B 3 0.75 1000 </code></pre> My attempt gave the following error <pre class="prettyprint"><code>>>> df1= df1[df1['col3'] >= float(df2[df2['col1']==df1['col1']]['col2'].values[0])] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/burcak/anaconda3/lib/python3.7/site-packages/pandas/core/ops/common.py", line 64, in new_method return method(self, other) File "/home/burcak/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py", line 521, in wrapper raise ValueError("Can only compare identically-labeled Series objects") ValueError: Can only compare identically-labeled Series objects </code></pre>

I will do <code>merge</code> <pre class="prettyprint"><code>out = df1.merge(df2[['col1','col2']], on = 'col1', suffixes = ('','1')).query('col3>=col21').drop('col21',1) out Out[15]: col1 col2 col3 col4 1 A 2 0.80 200 2 A 2 0.90 300 3 A 3 0.95 400 4 A 3 0.85 500 5 B 2 0.65 600 6 B 2 0.75 700 9 B 3 0.75 1000 </code></pre> Or <code>reindex</code> <pre class="prettyprint"><code>out = df1[df1['col3'] >= df2.set_index('col1')['col2'].reindex(df1['col1']).values] Out[19]: col1 col2 col3 col4 1 A 2 0.80 200 2 A 2 0.90 300 3 A 3 0.95 400 4 A 3 0.85 500 5 B 2 0.65 600 6 B 2 0.75 700 9 B 3 0.75 1000 </code></pre> You could also use <code>map</code>: <pre class="prettyprint"><code> df1.loc[df1.col3 >= df1.col1.map(df2.set_index("col1").col2)] </code></pre>

My method would be similar to @Ben_Yo 's merge answer, but more lines of code, but perhaps a little more straightforward. You simply: <ol> <li>Merge the column in and create new dataframe <code>s</code> </li> <li>Change the datafame <code>s</code> into a boolean series that returns <code>True</code> or <code>False</code> according to the condition, which in this case is <code>s['col3'] >= s['col2']</code> </li> <li>Finally, pass <code>s</code> to <code>df1</code>, and the outcome will exclude rows that returned <code>False</code> in the boolean series <code>s</code>:</li> </ol> <hr> <pre class="prettyprint"><code>s = pd.merge(df1[['col1', 'col3']], df2[['col1', 'col2']], how='left', on='col1') s = s['col3'] >= s['col2'] df1[s] Out[1]: col1 col2 col3 col4 1 A 2 0.80 200 2 A 2 0.90 300 3 A 3 0.95 400 4 A 3 0.85 500 5 B 2 0.65 600 6 B 2 0.75 700 9 B 3 0.75 1000 </code></pre>

Pandas DataFrame filter rows using another DataFrame Column

Tags:

python

pandas

dataframe

import pandas as pd
import numpy as np

dict1 = {'col1': ['A', 'A', 'A', 'A', 'A','B', 'B', 'B', 'B', 'B' ], 
       'col2':[2, 2, 2, 3, 3, 2, 2, 3, 3 , 3], 
       'col3':[0.7, 0.8, 0.9, 0.95, 0.85, 0.65, 0.75, 0.45, 0.55, 0.75 ],
       'col4':[100,200,300,400,500,600,700,800,900,1000]}
df1 = pd.DataFrame(data=dict1)
df1

dict2 = {'col1': ['A', 'B' ], 
       'col2':[0.75, 0.65], 
       'col3':[1000, 2000 ],
       'col4':[0.8, 0.9]}
df2 = pd.DataFrame(data=dict2)
df2

In fastest way how to filter df1 using df2, depending on df1['col3'] >= df2['col2'] for equal col1s?

Intended outcome

>>> df1
  col1  col2  col3  col4
1    A     2  0.80   200
2    A     2  0.90   300
3    A     3  0.95   400
4    A     3  0.85   500
5    B     2  0.65   600
6    B     2  0.75   700
9    B     3  0.75  1000

My attempt gave the following error

>>> df1= df1[df1['col3'] >= float(df2[df2['col1']==df1['col1']]['col2'].values[0])]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/burcak/anaconda3/lib/python3.7/site-packages/pandas/core/ops/common.py", line 64, in new_method
    return method(self, other)
  File "/home/burcak/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py", line 521, in wrapper
    raise ValueError("Can only compare identically-labeled Series objects")
ValueError: Can only compare identically-labeled Series objects

555

asked Sep 18 '20 00:09

burcak

2 Answers

I will do merge

out = df1.merge(df2[['col1','col2']], on = 'col1', suffixes = ('','1')).query('col3>=col21').drop('col21',1)

out
Out[15]: 
  col1  col2  col3  col4
1    A     2  0.80   200
2    A     2  0.90   300
3    A     3  0.95   400
4    A     3  0.85   500
5    B     2  0.65   600
6    B     2  0.75   700
9    B     3  0.75  1000

Or reindex

out = df1[df1['col3'] >= df2.set_index('col1')['col2'].reindex(df1['col1']).values]
Out[19]: 
  col1  col2  col3  col4
1    A     2  0.80   200
2    A     2  0.90   300
3    A     3  0.95   400
4    A     3  0.85   500
5    B     2  0.65   600
6    B     2  0.75   700
9    B     3  0.75  1000

You could also use map:

 df1.loc[df1.col3 >= df1.col1.map(df2.set_index("col1").col2)]

148

answered Oct 11 '22 20:10

BENY

My method would be similar to @Ben_Yo 's merge answer, but more lines of code, but perhaps a little more straightforward.

You simply:

Merge the column in and create new dataframe s
Change the datafame s into a boolean series that returns True or False according to the condition, which in this case is s['col3'] >= s['col2']
Finally, pass s to df1, and the outcome will exclude rows that returned False in the boolean series s:

s = pd.merge(df1[['col1', 'col3']], df2[['col1', 'col2']], how='left', on='col1')
s = s['col3'] >= s['col2']
df1[s]
Out[1]: 
  col1  col2  col3  col4
1    A     2  0.80   200
2    A     2  0.90   300
3    A     3  0.95   400
4    A     3  0.85   500
5    B     2  0.65   600
6    B     2  0.75   700
9    B     3  0.75  1000

answered Oct 11 '22 21:10

David Erickson

Related questions
                            
                                Increasing each element of a tensor by the predecessor in Tensorflow 2.0
                            
                                pymongo.errors.ServerSelectionTimeoutError:localhost:27017:[WinError 10061]No connection could be made because the target machine actively refused it
                            
                                Colorize the background of a seaborn plot using a column in dataframe
                            
                                Split a Python List into Chunks with Maximum Memory Size
                            
                                How can I add an element to a PyTorch tensor along a certain dimension?
                            
                                Renumbering line by line
                            
                                How to display a pandas dataframe within a VBOX using ipywidgets
                            
                                Breaking change for google-api-python-client 1.8.1 - AttributeError: module 'googleapiclient' has no attribute '__version__'
                            
                                Pydantic model for array of jsons
                            
                                Computing `AB⁻¹` with `np.linalg.solve()`
                            
                                Why can I not assign `cls.__hash__ = id`?
                            
                                Tkinter how to bind to shift+tab
                            
                                3D Gridded Data Interpolation in Julia
                            
                                AttributeError: 'tuple' object has no attribute 'rank' when calling fit on a Keras model with custom generator
                            
                                How to get numpy working properly in Anaconda Python 3.7.6
                            
                                How to scrape all topics from twitter
                            
                                What is a good design pattern to combine datasets that are related but stored in different dataframes?
                            
                                Tensorflow-gpu issue (CUDA runtime error: device kernel image is invalid)
                            
                                Prefect how to avoid rerunning a task
                            
                                Keras - no good way to stop and resume training?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With