As input I have two dataframes:
data1 = [{'code':100}, {'code':120}, {'code':110}]
data1 = pd.DataFrame(data1)
code
0 100
1 120
2 110
data2 = [{'category':1, 'l_bound':99, 'r_bound':105},{'category':2, 'l_bound':107, 'r_bound':110},{'category':3, 'l_bound':117, 'r_bound':135}]
data2 = pd.DataFrame(data2)
category l_bound r_bound
0 1 99 105
1 2 107 110
2 3 117 135
I want to get in the end the following dataframe, with additional column in the first dataframe as a category number if the code lies in the corresponding interval:
code category
0 100 1
1 120 3
2 110 2
Intervals are random and the original dataframes are pretty big. Looping with itertuples is too slow. Any pythonic solutions?
Recreate the dataset:
import pandas as pd
data1 = [{'code':100}, {'code':120}, {'code':113}]
data2 = [{'category':1, 'l_bound':99, 'r_bound':105},
{'category':2, 'l_bound':107, 'r_bound':110},
{'category':3, 'l_bound':117, 'r_bound':135}]
data1 = pd.DataFrame(data1)
data2 = pd.DataFrame(data2)
@cᴏʟᴅsᴘᴇᴇᴅ answer (prefered), follow duplicate link:
idx = pd.IntervalIndex.from_arrays(data2['l_bound'], data2['r_bound'], closed='both')
category = data2.loc[idx.get_indexer(data1.code), 'category']
data1['category'] = category.values
Here is a different approach. Create a map with value in range and categories.
# Create a map
d = {i:k for k,v in data2.set_index('category').to_dict('i').items()
for i in range(v['l_bound'],v['r_bound']+1)}
# Use map to add new column
data1['category'] = data1.code.map(d)
Finally
print(data1)
Returns:
code category
0 100 1.0
1 120 3.0
2 113 NaN
If you want int we can do this:
data1.code.map(d).fillna(-1).astype(int) # -1 meaning no match
And we get:
code category
0 100 1
1 120 3
2 113 -1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With