Suppose I have dataframe df1 which includes two columns - A & B. Value of A represents the lower range and value of B represents the upper range. <pre class="prettyprint"><code> A B 10.5 20.5 30.5 40.5 50.5 60.5 </code></pre> I've another dataframe which includes two columns - C & D containing a different range of numbers. <pre class="prettyprint"><code> C D 12.34 15.90 13.68 19.13 33.5 35.60 35.12 38.76 50.6 59.1 </code></pre> Now I want to list all the pairs from df2 that fall under the groups (between the lower and upper range) in the df1. Final output should be like this - <pre class="prettyprint"><code> Key Values (10.5, 20.5) [(12.34, 15.90), (13.68, 19.13)] (30.5, 40.5) [(33.5, 35.60), (35.12, 38.76)] (50.5, 60.5) [(50.6, 59.1)] </code></pre> The solution should be efficient as I have 5000 groups of range and 85000 numbers from different range.

It is not blazing fast (~ 30 secs) on my computer) but could easily be accelerated with the <code>multiprocessing</code> package if you have multiple cores. Generating data : <pre class="prettyprint"><code>def get_fake(n): df = pd.DataFrame(np.random.rand(n * 2).reshape(-1, 2)) df.loc[:, 1] += 1 return df df1 = get_fake(200) df2 = get_fake(90000) </code></pre> Then for the processing part : <pre class="prettyprint"><code>from collections import defaultdict result = defaultdict(list) for index, start, stop in df1.itertuples(): subdf = df2[(start < df2.iloc[:, 0]) & (df2.iloc[:, 1] < stop)] result[(start, stop)] += subdf.values.tolist() </code></pre> Result is a dict but could easily be converted to a Series if necessary.

How to list all the pairs of numbers which fall under a group of range?

Tags:

performance

python

pandas

data-science

Suppose I have dataframe df1 which includes two columns - A & B. Value of A represents the lower range and value of B represents the upper range.

I've another dataframe which includes two columns - C & D containing a different range of numbers.

  C     D
12.34  15.90
13.68  19.13
33.5   35.60
35.12  38.76
50.6   59.1

Now I want to list all the pairs from df2 that fall under the groups (between the lower and upper range) in the df1.

Final output should be like this -

     Key                Values
(10.5, 20.5)  [(12.34, 15.90), (13.68, 19.13)]
(30.5, 40.5)  [(33.5, 35.60), (35.12, 38.76)]
(50.5, 60.5)  [(50.6, 59.1)]

The solution should be efficient as I have 5000 groups of range and 85000 numbers from different range.

820

asked Jun 09 '18 11:06

Abdullah Al Imran

2 Answers

It is not blazing fast (~ 30 secs) on my computer) but could easily be accelerated with the multiprocessing package if you have multiple cores.

Generating data :

def get_fake(n):
    df = pd.DataFrame(np.random.rand(n * 2).reshape(-1, 2))
    df.loc[:, 1] += 1
    return df

df1 = get_fake(200)
df2 = get_fake(90000)

Then for the processing part :

from collections import defaultdict
result = defaultdict(list)
for index, start, stop in df1.itertuples():
    subdf = df2[(start < df2.iloc[:, 0]) & (df2.iloc[:, 1] < stop)]
    result[(start, stop)] += subdf.values.tolist()

Result is a dict but could easily be converted to a Series if necessary.

122

answered Nov 01 '22 12:11

Jacquot

It will be easy if you use interval index i.e

idx = pd.IntervalIndex.from_arrays(df['A'],df['B'])
keys = df.values.tolist()
values = df2.groupby(df.loc[idx.get_indexer(df2['C'])].index).apply(lambda x : x.values)

new_df = pd.DataFrame({'key': keys , 'value': values})

          key                            value
0  [10.5, 20.5]  [[12.34, 15.9], [13.68, 19.13]]
1  [30.5, 40.5]   [[33.5, 35.6], [35.12, 38.76]]
2  [50.5, 60.5]                   [[50.6, 59.1]]

Accessing data based on interval index will give you the keys so you can groupby and aggregate i.e

df.loc[idx.get_indexer(df2['C'])]
     A     B
0  10.5  20.5
0  10.5  20.5
1  30.5  40.5
1  30.5  40.5
2  50.5  60.5

answered Nov 01 '22 11:11

Bharath

Related questions
                            
                                Tensorflow: Use model trained in CUDNNLSTM in cpu
                            
                                Calculate pvalue from pandas DataFrame
                            
                                How do I enforce a square grid in matplotlib
                            
                                xgboost.train versus XGBClassifier
                            
                                Finding contours of a two-part letter
                            
                                Django - conditional foreign key
                            
                                Given two "if" statements, execute some code if none of them is executed
                            
                                How to replace the input of a saved graph, e.g. a placeholder by a Dataset iterator?
                            
                                Get the requirements of a package in PyPI without installing it? [duplicate]
                            
                                Aligning maps made using basemap
                            
                                Find the percentile of a value
                            
                                assignment within exec in python
                            
                                SQL equivalent for Pandas's [df.groupby(...)['col_name'].shift(1)]
                            
                                How to make a triple equivalence dictionary?
                            
                                Broken DAG: No module named 'airflow.contrib.gsc_to_gcs'
                            
                                Why does the Python linecache affect the traceback module but not regular tracebacks?
                            
                                Filter out rows of panda-df by comparing to list [duplicate]
                            
                                Flask-RESTplus CORS request not adding headers into the response
                            
                                TypeError: products() got multiple values for argument 'pk'
                            
                                How to customize the pytest name

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With