import pandas as pd
import numpy as np
dict1 = {'col1': ['A', 'A', 'A', 'A', 'A','B', 'B', 'B', 'B', 'B' ],
'col2':[2, 2, 2, 3, 3, 2, 2, 3, 3 , 3],
'col3':[0.7, 0.8, 0.9, 0.95, 0.85, 0.65, 0.75, 0.45, 0.55, 0.75 ],
'col4':[100,200,300,400,500,600,700,800,900,1000]}
df1 = pd.DataFrame(data=dict1)
df1
dict2 = {'col1': ['A', 'B' ],
'col2':[0.75, 0.65],
'col3':[1000, 2000 ],
'col4':[0.8, 0.9]}
df2 = pd.DataFrame(data=dict2)
df2
In fastest way how to filter df1 using df2, depending on df1['col3'] >= df2['col2'] for equal col1s?
Intended outcome
>>> df1
col1 col2 col3 col4
1 A 2 0.80 200
2 A 2 0.90 300
3 A 3 0.95 400
4 A 3 0.85 500
5 B 2 0.65 600
6 B 2 0.75 700
9 B 3 0.75 1000
My attempt gave the following error
>>> df1= df1[df1['col3'] >= float(df2[df2['col1']==df1['col1']]['col2'].values[0])]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/burcak/anaconda3/lib/python3.7/site-packages/pandas/core/ops/common.py", line 64, in new_method
return method(self, other)
File "/home/burcak/anaconda3/lib/python3.7/site-packages/pandas/core/ops/__init__.py", line 521, in wrapper
raise ValueError("Can only compare identically-labeled Series objects")
ValueError: Can only compare identically-labeled Series objects
Use the syntax new_DataFrame = DataFrame[(DataFrame[column]==criteria1) operator (DataFrame[column2]==criteria2)] , where operator is & or | , to filter a pandas. DataFrame by multiple columns.
Filter Rows by Condition You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows.
I will do merge
out = df1.merge(df2[['col1','col2']], on = 'col1', suffixes = ('','1')).query('col3>=col21').drop('col21',1)
out
Out[15]:
col1 col2 col3 col4
1 A 2 0.80 200
2 A 2 0.90 300
3 A 3 0.95 400
4 A 3 0.85 500
5 B 2 0.65 600
6 B 2 0.75 700
9 B 3 0.75 1000
Or reindex
out = df1[df1['col3'] >= df2.set_index('col1')['col2'].reindex(df1['col1']).values]
Out[19]:
col1 col2 col3 col4
1 A 2 0.80 200
2 A 2 0.90 300
3 A 3 0.95 400
4 A 3 0.85 500
5 B 2 0.65 600
6 B 2 0.75 700
9 B 3 0.75 1000
You could also use map
:
df1.loc[df1.col3 >= df1.col1.map(df2.set_index("col1").col2)]
My method would be similar to @Ben_Yo 's merge answer, but more lines of code, but perhaps a little more straightforward.
You simply:
s
s
into a boolean series that returns True
or False
according to the condition, which in this case is s['col3'] >= s['col2']
s
to df1
, and the outcome will exclude rows that returned False
in the boolean series s
:s = pd.merge(df1[['col1', 'col3']], df2[['col1', 'col2']], how='left', on='col1')
s = s['col3'] >= s['col2']
df1[s]
Out[1]:
col1 col2 col3 col4
1 A 2 0.80 200
2 A 2 0.90 300
3 A 3 0.95 400
4 A 3 0.85 500
5 B 2 0.65 600
6 B 2 0.75 700
9 B 3 0.75 1000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With