Pandas

Question

I struggle while finding the lowest value within a range defined by values in other column. The range is always defined by two similar values in boo column (1-1, 2-2), shown also on image below. Values (boo column) are not known in advance (so I cannot make a list and compare them), because they are calculated in the code few steps earlier.

Ranges explained

Dataframe example

    foo           boo
15  36.377949      
16  42.489706     1
17  41.223734      
18  32.281779     0
19  22.888312     2   
20  12.847996      
21   6.876954      
22 -23.872935     1    
23 -31.858878         
24 -39.404905     3   
25 -47.724924     2
26 -4.8161051     3

The output is preferred as new dataframe column

    foo           boo    min
15  36.377949      
16  42.489706     1
17  41.223734      
18  32.281779     0      
19  22.888312     2   
20  12.847996      
21   6.876954      
22 -23.872935     1      -23
23 -31.858878         
24 -39.404905     3   
25 -47.724924     2      -47
26 -4.8161051     3      -47

I know how to solve this using basic for loop (and not leveradging from Pandas functions and speed), so I would like to keep this on dataframe/Pandas/Numpy level, if possible.

Is there a way how to do it using Pandas/Numpy? Any comments, suggestions and help is appreciated!

EDIT

I have tried to implement both suggested methods (and they work for small data sets!), however the execution time is not good with larger datasets. I use dataframes with 1.5 - 2.5 milions of rows, which would take "forever" (according to increase of execution times).

function1 is the one using find_min_in_range (from Pedro Rocha) and function2 is using for loop (from mozway).

x axis is number of rows in dataframe
y axis is execution time in seconds
I have tested from 10000 to 200000 rows
while using my "usuall" dataframe, none of given solution finished
using func1 adjusted whole df iteration took 4-5 minutes

func1 vs func2 vs func1 adjusted

EDIT2

I've done another round of testing using provided solutions. It seems like there is significant improvement.

Below are test results with more metrics (number of ranges).

it seems that Pedros Option 1 and Option 3 are the best performers
number of ranges in our data samples seems to increase linearly
only Pedros Option 3 was able to finish finding min on our "whole" data sample [in 172 seconds 1725410 rows and 204954 ranges], but I am unsure whether it will work in our use case - because we are executing script on 1 vCPU machines - but I am gonna try it ^^
here the sample with one of the larges number of ranges in our data, if you would like to use our data (can be loaded using read_parquet() method)
it seems like the number of ranges is always less than 25 % of total row count
ranges are always closed (range 0 is exception and only not closed range)

Lower number of ranges Higher number of ranges

However I am unable to reproduce Option 2 from Pedro. Code below (copy paste from generator and option 2).

num_rows = 2000000
num_ranges = 10000

foo_values = [random.uniform(-100, 100) for i in range(num_rows)]
boo_values = [i for i in range(num_ranges)]
boo_values.extend([i for i in range(num_ranges)])
a = np.empty(num_rows-len(boo_values))
a[:] = np.nan
boo_values.extend(a)

random.shuffle(boo_values)

df = pd.DataFrame({"foo": foo_values, "boo": boo_values})

df["min"] = np.nan
idx = df[df.groupby('boo').cumcount() == 1].index
df.loc[idx,"min"] = df.loc[idx].apply(lambda row: df.loc[range(*df[df.boo == row["boo"]].index[[0,-1]]+[0,1]),"foo"].min(), axis=1)

Error

  File "/.../option2.py", line 96, in <lambda>
    df.loc[idx,"min"] = df.loc[idx].apply(lambda row: df.loc[range(*df[df.boo == row["boo"]].index[[0,-1]]+[0,1]),"foo"].min(), axis=1)
  File "/.../venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5069, in __getitem__
    result = getitem(key)
IndexError: index 0 is out of bounds for axis 0 with size 0

Also I am thinking about closing this question and creating a new one focused on performance enhacements with provided data sample and more precise description.

EDIT3

I am sharing the way how I benchmark different approaches.

using lighter version of this as input data (lower number of ranges - approx. 200k)
measuring execution time using timeit
copy&paste results form terminal into GSheet and creating chart

from timeit import default_timer as timer
import pandas as pd
import numpy as np

# Skipped loading the dataframe, can be mocked by using read_parquet from given example file

for i in range(10000, 210000, 10000):
  df = input_df[:i].copy()
  start = timer()
  # here is the code of one or more solution/s
  ...
  f['x'] = round(timer() - start, 1)

  times.append(f)

# Print out execution time
for t in times:
  print(t['x'])

mozway · Accepted Answer

You cannot fully vectorize this as the rows are overlapping, but you can limit the loop to the unique values in boo:

for v in df['boo'].dropna().unique():
    m = df['boo'].eq(v)
    s = df.loc[m.cummax() & m[::-1].cummax(), 'foo']
    if len(s)>1:
        df.loc[s.index[-1], 'min'] = s.min()
print(df)

Output:

          foo  boo        min
15  36.377949  NaN        NaN
16  42.489706  1.0        NaN
17  41.223734  NaN        NaN
18  32.281779  0.0        NaN
19  22.888312  2.0        NaN
20  12.847996  NaN        NaN
21   6.876954  NaN        NaN
22 -23.872935  1.0 -23.872935
23 -31.858878  NaN        NaN
24 -39.404905  3.0        NaN
25 -47.724924  2.0 -47.724924
26  -4.816105  3.0 -47.724924

rocha-p · Answer

This code first finds the index of the occurrences, if the value in the "boo" column is nan or is the first occurrence, returns nan. Else returns the min value in the range.

Option 1:

import numpy as np
import pandas as pd

def find_min_in_range(row):
    idx = df[df.boo == row["boo"]].index
    if row.name == idx[0]:
        return np.nan
    else:
        return df.loc[range(*(idx)+[0,1]),"foo"].min()

df["min"] = np.nan
idx = [df.boo[df.boo == i].index.max() for i in df.boo.dropna().unique()]
df.loc[idx,"min"] = df.loc[idx].apply(find_min_in_range, axis=1)

Option 2: find idx using groupby+cumcount. Also using lambda function. Better performance if you have a lot of open ranges (1 occurrence in "boo" column, like range 0 in the example)

import pandas as pd
import numpy as np

df["min"] = np.nan
idx = df[df.groupby('boo').cumcount() == 1].index
df.loc[idx,"min"] = df.loc[idx].apply(lambda row: df.loc[range(*df[df.boo == row["boo"]].index[[0,-1]]+[0,1]),"foo"].min(), axis=1)

Option 3: Using OOP + multiprocessing. Improves performance for large datasets

import pandas as pd
import numpy as np
from threading import Thread
 
class MinFinder():
    def __init__(self,df):
        self.df = df
        self.range_idx = self.get_range_idx()
        self.min_values = {}

    def get_range_idx(self):
        df_tmp = self.df.reset_index().dropna(subset="boo")
        df_tmp.boo = df_tmp.boo.astype(int)
        df_tmp = df_tmp.set_index("boo")
        return df_tmp.groupby(df_tmp.index)["index"].unique().to_dict()

    def find_min_in_range(self,r):
        idx = self.range_idx[int(r)]
        self.min_values[idx.max()] = self.df.loc[range(*df[df.boo == r].index[[0,-1]]+[0,1]),"foo"].min()
    
finder = MinFinder(df)

t = []
for r in df[df.groupby('boo').cumcount() == 1]["boo"].values:
    t.append(Thread(target=finder.find_min_in_range, args=(r,)))
[thread.start() for thread in t]
[thread.join() for thread in t]

df["min"] = np.nan
df.loc[finder.min_values.keys(),"min"] = [v for v in finder.min_values.values()]

Output:

	foo	boo	min
0	36.3779	nan	nan
1	42.4897	1	nan
2	41.2237	nan	nan
3	32.2818	0	nan
4	22.8883	2	nan
5	12.848	nan	nan
6	6.87695	nan	nan
7	-23.8729	1	-23.8729
8	-31.8589	nan	nan
9	-39.4049	3	nan
10	-47.7249	2	-47.7249
11	-4.81611	3	-47.7249

Testing:

Option 1 summary: I've tested the code with 2M rows and 10k ranges and the test finished after 6:40 min. While for 2M rows and 1k ranges it takes 42.80 segs.

Option 3 summary: I've tested the code with 2M rows and 10k ranges and the test finished after 4:20 min. While for 2M rows and 1k ranges it takes 31.40 segs.

Tested using a MacbookPro,M1PRO,32gb

Option 1 results:

Rows	Ranges	Time (segs)
50000	100	0.13
150000	1100	3.70
250000	2100	10.20
350000	3100	22.00
450000	4100	38.00
550000	5100	58.20
650000	6100	84.00
750000	7100	114.50
850000	8100	145.16
950000	9100	180.70
2000000	10000	400.00
2000000	1000	42.8

Option 3 results:

Rows	Ranges	Time (segs)
50000	100	0.22
150000	1100	3.92
250000	2100	10.68
350000	3100	20.44
450000	4100	33.45
550000	5100	49.26
650000	6100	70.19
750000	7100	87.79
850000	8100	116.37
950000	9100	143.00
2000000	10000	260.00
2000000	1000	31.40

Code to create the test df:

num_rows = 2000000
num_ranges = 10000

foo_values = [random.uniform(-100, 100) for i in range(num_rows)]
boo_values = [i for i in range(num_ranges)]
boo_values.extend([i for i in range(num_ranges)])
a = np.empty(num_rows-len(boo_values))
a[:] = np.nan
boo_values.extend(a)

random.shuffle(boo_values)

df = pd.DataFrame({"foo": foo_values, "boo": boo_values})

Pandas - Find the lowest value in range defined by certain values?

Tags:

python

dataframe

EDIT

EDIT2

EDIT3

Filip Niko

2 Answers

mozway

rocha-p

Recent Activity

Donate For Us

Pandas - Find the lowest value in range defined by certain values?

Tags:

python

pandas

dataframe

EDIT

EDIT2

EDIT3

Filip Niko

2 Answers

mozway

rocha-p

Related questions

Recent Activity

Donate For Us