Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - Find the lowest value in range defined by certain values?

I struggle while finding the lowest value within a range defined by values in other column. The range is always defined by two similar values in boo column (1-1, 2-2), shown also on image below. Values (boo column) are not known in advance (so I cannot make a list and compare them), because they are calculated in the code few steps earlier.

Ranges explained

Dataframe example

    foo           boo
15  36.377949      
16  42.489706     1
17  41.223734      
18  32.281779     0
19  22.888312     2   
20  12.847996      
21   6.876954      
22 -23.872935     1    
23 -31.858878         
24 -39.404905     3   
25 -47.724924     2
26 -4.8161051     3

The output is preferred as new dataframe column

    foo           boo    min
15  36.377949      
16  42.489706     1
17  41.223734      
18  32.281779     0      
19  22.888312     2   
20  12.847996      
21   6.876954      
22 -23.872935     1      -23
23 -31.858878         
24 -39.404905     3   
25 -47.724924     2      -47
26 -4.8161051     3      -47

I know how to solve this using basic for loop (and not leveradging from Pandas functions and speed), so I would like to keep this on dataframe/Pandas/Numpy level, if possible.

Is there a way how to do it using Pandas/Numpy? Any comments, suggestions and help is appreciated!


EDIT

I have tried to implement both suggested methods (and they work for small data sets!), however the execution time is not good with larger datasets. I use dataframes with 1.5 - 2.5 milions of rows, which would take "forever" (according to increase of execution times).

function1 is the one using find_min_in_range (from Pedro Rocha) and function2 is using for loop (from mozway).

  • x axis is number of rows in dataframe
  • y axis is execution time in seconds
  • I have tested from 10000 to 200000 rows
  • while using my "usuall" dataframe, none of given solution finished
  • using func1 adjusted whole df iteration took 4-5 minutes

func1 vs func2 vs func1 adjusted

EDIT2

I've done another round of testing using provided solutions. It seems like there is significant improvement.

Below are test results with more metrics (number of ranges).

  • it seems that Pedros Option 1 and Option 3 are the best performers
  • number of ranges in our data samples seems to increase linearly
  • only Pedros Option 3 was able to finish finding min on our "whole" data sample [in 172 seconds 1725410 rows and 204954 ranges], but I am unsure whether it will work in our use case - because we are executing script on 1 vCPU machines - but I am gonna try it ^^
  • here the sample with one of the larges number of ranges in our data, if you would like to use our data (can be loaded using read_parquet() method)
  • it seems like the number of ranges is always less than 25 % of total row count
  • ranges are always closed (range 0 is exception and only not closed range)

Lower number of ranges lower number of ranges Higher number of ranges higher number of ranges

However I am unable to reproduce Option 2 from Pedro. Code below (copy paste from generator and option 2).

num_rows = 2000000
num_ranges = 10000

foo_values = [random.uniform(-100, 100) for i in range(num_rows)]
boo_values = [i for i in range(num_ranges)]
boo_values.extend([i for i in range(num_ranges)])
a = np.empty(num_rows-len(boo_values))
a[:] = np.nan
boo_values.extend(a)

random.shuffle(boo_values)

df = pd.DataFrame({"foo": foo_values, "boo": boo_values})

df["min"] = np.nan
idx = df[df.groupby('boo').cumcount() == 1].index
df.loc[idx,"min"] = df.loc[idx].apply(lambda row: df.loc[range(*df[df.boo == row["boo"]].index[[0,-1]]+[0,1]),"foo"].min(), axis=1)

Error

  File "/.../option2.py", line 96, in <lambda>
    df.loc[idx,"min"] = df.loc[idx].apply(lambda row: df.loc[range(*df[df.boo == row["boo"]].index[[0,-1]]+[0,1]),"foo"].min(), axis=1)
  File "/.../venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5069, in __getitem__
    result = getitem(key)
IndexError: index 0 is out of bounds for axis 0 with size 0

Also I am thinking about closing this question and creating a new one focused on performance enhacements with provided data sample and more precise description.

EDIT3

I am sharing the way how I benchmark different approaches.

  • using lighter version of this as input data (lower number of ranges - approx. 200k)
  • measuring execution time using timeit
  • copy&paste results form terminal into GSheet and creating chart
from timeit import default_timer as timer
import pandas as pd
import numpy as np

# Skipped loading the dataframe, can be mocked by using read_parquet from given example file

for i in range(10000, 210000, 10000):
  df = input_df[:i].copy()
  start = timer()
  # here is the code of one or more solution/s
  ...
  f['x'] = round(timer() - start, 1)

  times.append(f)

# Print out execution time
for t in times:
  print(t['x'])
like image 822
Filip Niko Avatar asked Sep 01 '25 10:09

Filip Niko


2 Answers

You cannot fully vectorize this as the rows are overlapping, but you can limit the loop to the unique values in boo:

for v in df['boo'].dropna().unique():
    m = df['boo'].eq(v)
    s = df.loc[m.cummax() & m[::-1].cummax(), 'foo']
    if len(s)>1:
        df.loc[s.index[-1], 'min'] = s.min()
print(df)

Output:

          foo  boo        min
15  36.377949  NaN        NaN
16  42.489706  1.0        NaN
17  41.223734  NaN        NaN
18  32.281779  0.0        NaN
19  22.888312  2.0        NaN
20  12.847996  NaN        NaN
21   6.876954  NaN        NaN
22 -23.872935  1.0 -23.872935
23 -31.858878  NaN        NaN
24 -39.404905  3.0        NaN
25 -47.724924  2.0 -47.724924
26  -4.816105  3.0 -47.724924
like image 152
mozway Avatar answered Sep 04 '25 04:09

mozway


This code first finds the index of the occurrences, if the value in the "boo" column is nan or is the first occurrence, returns nan. Else returns the min value in the range.

Option 1:

import numpy as np
import pandas as pd

def find_min_in_range(row):
    idx = df[df.boo == row["boo"]].index
    if row.name == idx[0]:
        return np.nan
    else:
        return df.loc[range(*(idx)+[0,1]),"foo"].min()

df["min"] = np.nan
idx = [df.boo[df.boo == i].index.max() for i in df.boo.dropna().unique()]
df.loc[idx,"min"] = df.loc[idx].apply(find_min_in_range, axis=1)

Option 2: find idx using groupby+cumcount. Also using lambda function. Better performance if you have a lot of open ranges (1 occurrence in "boo" column, like range 0 in the example)

import pandas as pd
import numpy as np

df["min"] = np.nan
idx = df[df.groupby('boo').cumcount() == 1].index
df.loc[idx,"min"] = df.loc[idx].apply(lambda row: df.loc[range(*df[df.boo == row["boo"]].index[[0,-1]]+[0,1]),"foo"].min(), axis=1)

Option 3: Using OOP + multiprocessing. Improves performance for large datasets

import pandas as pd
import numpy as np
from threading import Thread
 
class MinFinder():
    def __init__(self,df):
        self.df = df
        self.range_idx = self.get_range_idx()
        self.min_values = {}

    def get_range_idx(self):
        df_tmp = self.df.reset_index().dropna(subset="boo")
        df_tmp.boo = df_tmp.boo.astype(int)
        df_tmp = df_tmp.set_index("boo")
        return df_tmp.groupby(df_tmp.index)["index"].unique().to_dict()

    def find_min_in_range(self,r):
        idx = self.range_idx[int(r)]
        self.min_values[idx.max()] = self.df.loc[range(*df[df.boo == r].index[[0,-1]]+[0,1]),"foo"].min()
    
finder = MinFinder(df)

t = []
for r in df[df.groupby('boo').cumcount() == 1]["boo"].values:
    t.append(Thread(target=finder.find_min_in_range, args=(r,)))
[thread.start() for thread in t]
[thread.join() for thread in t]

df["min"] = np.nan
df.loc[finder.min_values.keys(),"min"] = [v for v in finder.min_values.values()]

Output:

foo boo min
0 36.3779 nan nan
1 42.4897 1 nan
2 41.2237 nan nan
3 32.2818 0 nan
4 22.8883 2 nan
5 12.848 nan nan
6 6.87695 nan nan
7 -23.8729 1 -23.8729
8 -31.8589 nan nan
9 -39.4049 3 nan
10 -47.7249 2 -47.7249
11 -4.81611 3 -47.7249

Testing:

Option 1 summary: I've tested the code with 2M rows and 10k ranges and the test finished after 6:40 min. While for 2M rows and 1k ranges it takes 42.80 segs.

Option 3 summary: I've tested the code with 2M rows and 10k ranges and the test finished after 4:20 min. While for 2M rows and 1k ranges it takes 31.40 segs.

Tested using a MacbookPro,M1PRO,32gb

Option 1 results:

Rows Ranges Time (segs)
50000 100 0.13
150000 1100 3.70
250000 2100 10.20
350000 3100 22.00
450000 4100 38.00
550000 5100 58.20
650000 6100 84.00
750000 7100 114.50
850000 8100 145.16
950000 9100 180.70
2000000 10000 400.00
2000000 1000 42.8

Option 3 results:

Rows Ranges Time (segs)
50000 100 0.22
150000 1100 3.92
250000 2100 10.68
350000 3100 20.44
450000 4100 33.45
550000 5100 49.26
650000 6100 70.19
750000 7100 87.79
850000 8100 116.37
950000 9100 143.00
2000000 10000 260.00
2000000 1000 31.40

Code to create the test df:

num_rows = 2000000
num_ranges = 10000

foo_values = [random.uniform(-100, 100) for i in range(num_rows)]
boo_values = [i for i in range(num_ranges)]
boo_values.extend([i for i in range(num_ranges)])
a = np.empty(num_rows-len(boo_values))
a[:] = np.nan
boo_values.extend(a)

random.shuffle(boo_values)

df = pd.DataFrame({"foo": foo_values, "boo": boo_values})
like image 40
rocha-p Avatar answered Sep 04 '25 05:09

rocha-p