Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to list all the pairs of numbers which fall under a group of range?

Suppose I have dataframe df1 which includes two columns - A & B. Value of A represents the lower range and value of B represents the upper range.

  A     B
10.5  20.5
30.5  40.5
50.5  60.5

I've another dataframe which includes two columns - C & D containing a different range of numbers.

  C     D
12.34  15.90
13.68  19.13
33.5   35.60
35.12  38.76
50.6   59.1

Now I want to list all the pairs from df2 that fall under the groups (between the lower and upper range) in the df1.

Final output should be like this -

     Key                Values
(10.5, 20.5)  [(12.34, 15.90), (13.68, 19.13)]
(30.5, 40.5)  [(33.5, 35.60), (35.12, 38.76)]
(50.5, 60.5)  [(50.6, 59.1)]

The solution should be efficient as I have 5000 groups of range and 85000 numbers from different range.

like image 820
Abdullah Al Imran Avatar asked Jun 09 '18 11:06

Abdullah Al Imran


People also ask

How do you list a range?

The range is the difference between the smallest and highest numbers in a list or set. To find the range, first put all the numbers in order. Then subtract (take away) the lowest number from the highest. The answer gives you the range of the list.

Is Range () a list?

This is because range() returns a range object. This range object is an object that stores a range of numbers. It is thus not a list or tuple and has nothing to do with them. The range object is iterable, though.

Which function creates a range of numbers in the form of list?

Using Python range() Python comes with a direct function range() which creates a sequence of numbers from start to stop values and print each item in the sequence.

How do you pair elements in a list in Python?

To join pairs of list elements in Python: Use the range() class with a step argument or 2. Use a list comprehension to iterate over the range of indexes. On each iteration, join the list item at the current index with the item at the next index.


2 Answers

It is not blazing fast (~ 30 secs) on my computer) but could easily be accelerated with the multiprocessing package if you have multiple cores.

Generating data :

def get_fake(n):
    df = pd.DataFrame(np.random.rand(n * 2).reshape(-1, 2))
    df.loc[:, 1] += 1
    return df

df1 = get_fake(200)
df2 = get_fake(90000)

Then for the processing part :

from collections import defaultdict
result = defaultdict(list)
for index, start, stop in df1.itertuples():
    subdf = df2[(start < df2.iloc[:, 0]) & (df2.iloc[:, 1] < stop)]
    result[(start, stop)] += subdf.values.tolist()

Result is a dict but could easily be converted to a Series if necessary.

like image 122
Jacquot Avatar answered Nov 01 '22 12:11

Jacquot


It will be easy if you use interval index i.e

idx = pd.IntervalIndex.from_arrays(df['A'],df['B'])
keys = df.values.tolist()
values = df2.groupby(df.loc[idx.get_indexer(df2['C'])].index).apply(lambda x : x.values)

new_df = pd.DataFrame({'key': keys , 'value': values})

          key                            value
0  [10.5, 20.5]  [[12.34, 15.9], [13.68, 19.13]]
1  [30.5, 40.5]   [[33.5, 35.6], [35.12, 38.76]]
2  [50.5, 60.5]                   [[50.6, 59.1]]

Accessing data based on interval index will give you the keys so you can groupby and aggregate i.e

df.loc[idx.get_indexer(df2['C'])]
     A     B
0  10.5  20.5
0  10.5  20.5
1  30.5  40.5
1  30.5  40.5
2  50.5  60.5
like image 37
Bharath Avatar answered Nov 01 '22 11:11

Bharath