Suppose I have dataframe df1 which includes two columns - A & B. Value of A represents the lower range and value of B represents the upper range.
A B
10.5 20.5
30.5 40.5
50.5 60.5
I've another dataframe which includes two columns - C & D containing a different range of numbers.
C D
12.34 15.90
13.68 19.13
33.5 35.60
35.12 38.76
50.6 59.1
Now I want to list all the pairs from df2 that fall under the groups (between the lower and upper range) in the df1.
Final output should be like this -
Key Values
(10.5, 20.5) [(12.34, 15.90), (13.68, 19.13)]
(30.5, 40.5) [(33.5, 35.60), (35.12, 38.76)]
(50.5, 60.5) [(50.6, 59.1)]
The solution should be efficient as I have 5000 groups of range and 85000 numbers from different range.
The range is the difference between the smallest and highest numbers in a list or set. To find the range, first put all the numbers in order. Then subtract (take away) the lowest number from the highest. The answer gives you the range of the list.
This is because range() returns a range object. This range object is an object that stores a range of numbers. It is thus not a list or tuple and has nothing to do with them. The range object is iterable, though.
Using Python range() Python comes with a direct function range() which creates a sequence of numbers from start to stop values and print each item in the sequence.
To join pairs of list elements in Python: Use the range() class with a step argument or 2. Use a list comprehension to iterate over the range of indexes. On each iteration, join the list item at the current index with the item at the next index.
It is not blazing fast (~ 30 secs) on my computer) but could easily be accelerated with the multiprocessing
package if you have multiple cores.
Generating data :
def get_fake(n):
df = pd.DataFrame(np.random.rand(n * 2).reshape(-1, 2))
df.loc[:, 1] += 1
return df
df1 = get_fake(200)
df2 = get_fake(90000)
Then for the processing part :
from collections import defaultdict
result = defaultdict(list)
for index, start, stop in df1.itertuples():
subdf = df2[(start < df2.iloc[:, 0]) & (df2.iloc[:, 1] < stop)]
result[(start, stop)] += subdf.values.tolist()
Result is a dict but could easily be converted to a Series if necessary.
It will be easy if you use interval index
i.e
idx = pd.IntervalIndex.from_arrays(df['A'],df['B'])
keys = df.values.tolist()
values = df2.groupby(df.loc[idx.get_indexer(df2['C'])].index).apply(lambda x : x.values)
new_df = pd.DataFrame({'key': keys , 'value': values})
key value
0 [10.5, 20.5] [[12.34, 15.9], [13.68, 19.13]]
1 [30.5, 40.5] [[33.5, 35.6], [35.12, 38.76]]
2 [50.5, 60.5] [[50.6, 59.1]]
Accessing data based on interval index will give you the keys so you can groupby and aggregate i.e
df.loc[idx.get_indexer(df2['C'])]
A B
0 10.5 20.5
0 10.5 20.5
1 30.5 40.5
1 30.5 40.5
2 50.5 60.5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With