I struggle while finding the lowest value within a range defined by values in other column. The range is always defined by two similar values in boo
column (1-1, 2-2), shown also on image below. Values (boo
column) are not known in advance (so I cannot make a list and compare them), because they are calculated in the code few steps earlier.
Dataframe example
foo boo
15 36.377949
16 42.489706 1
17 41.223734
18 32.281779 0
19 22.888312 2
20 12.847996
21 6.876954
22 -23.872935 1
23 -31.858878
24 -39.404905 3
25 -47.724924 2
26 -4.8161051 3
The output is preferred as new dataframe column
foo boo min
15 36.377949
16 42.489706 1
17 41.223734
18 32.281779 0
19 22.888312 2
20 12.847996
21 6.876954
22 -23.872935 1 -23
23 -31.858878
24 -39.404905 3
25 -47.724924 2 -47
26 -4.8161051 3 -47
I know how to solve this using basic for
loop (and not leveradging from Pandas functions and speed), so I would like to keep this on dataframe/Pandas/Numpy level, if possible.
Is there a way how to do it using Pandas/Numpy? Any comments, suggestions and help is appreciated!
I have tried to implement both suggested methods (and they work for small data sets!), however the execution time is not good with larger datasets. I use dataframes with 1.5 - 2.5 milions of rows, which would take "forever" (according to increase of execution times).
function1
is the one using find_min_in_range
(from Pedro Rocha) and function2
is using for
loop (from mozway).
10000
to 200000
rowsfunc1 adjusted
whole df iteration took 4-5 minutesI've done another round of testing using provided solutions. It seems like there is significant improvement.
Below are test results with more metrics (number of ranges).
172
seconds 1725410
rows and 204954
ranges], but I am unsure whether it will work in our use case - because we are executing script on 1 vCPU machines - but I am gonna try it ^^number of ranges
is always less than 25 %
of total row count
Lower number of ranges
Higher number of ranges
However I am unable to reproduce Option 2 from Pedro. Code below (copy paste from generator and option 2).
num_rows = 2000000
num_ranges = 10000
foo_values = [random.uniform(-100, 100) for i in range(num_rows)]
boo_values = [i for i in range(num_ranges)]
boo_values.extend([i for i in range(num_ranges)])
a = np.empty(num_rows-len(boo_values))
a[:] = np.nan
boo_values.extend(a)
random.shuffle(boo_values)
df = pd.DataFrame({"foo": foo_values, "boo": boo_values})
df["min"] = np.nan
idx = df[df.groupby('boo').cumcount() == 1].index
df.loc[idx,"min"] = df.loc[idx].apply(lambda row: df.loc[range(*df[df.boo == row["boo"]].index[[0,-1]]+[0,1]),"foo"].min(), axis=1)
Error
File "/.../option2.py", line 96, in <lambda>
df.loc[idx,"min"] = df.loc[idx].apply(lambda row: df.loc[range(*df[df.boo == row["boo"]].index[[0,-1]]+[0,1]),"foo"].min(), axis=1)
File "/.../venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5069, in __getitem__
result = getitem(key)
IndexError: index 0 is out of bounds for axis 0 with size 0
Also I am thinking about closing this question and creating a new one focused on performance enhacements with provided data sample and more precise description.
I am sharing the way how I benchmark different approaches.
timeit
from timeit import default_timer as timer
import pandas as pd
import numpy as np
# Skipped loading the dataframe, can be mocked by using read_parquet from given example file
for i in range(10000, 210000, 10000):
df = input_df[:i].copy()
start = timer()
# here is the code of one or more solution/s
...
f['x'] = round(timer() - start, 1)
times.append(f)
# Print out execution time
for t in times:
print(t['x'])
You cannot fully vectorize this as the rows are overlapping, but you can limit the loop to the unique values in boo
:
for v in df['boo'].dropna().unique():
m = df['boo'].eq(v)
s = df.loc[m.cummax() & m[::-1].cummax(), 'foo']
if len(s)>1:
df.loc[s.index[-1], 'min'] = s.min()
print(df)
Output:
foo boo min
15 36.377949 NaN NaN
16 42.489706 1.0 NaN
17 41.223734 NaN NaN
18 32.281779 0.0 NaN
19 22.888312 2.0 NaN
20 12.847996 NaN NaN
21 6.876954 NaN NaN
22 -23.872935 1.0 -23.872935
23 -31.858878 NaN NaN
24 -39.404905 3.0 NaN
25 -47.724924 2.0 -47.724924
26 -4.816105 3.0 -47.724924
This code first finds the index of the occurrences, if the value in the "boo" column is nan or is the first occurrence, returns nan. Else returns the min value in the range.
Option 1:
import numpy as np
import pandas as pd
def find_min_in_range(row):
idx = df[df.boo == row["boo"]].index
if row.name == idx[0]:
return np.nan
else:
return df.loc[range(*(idx)+[0,1]),"foo"].min()
df["min"] = np.nan
idx = [df.boo[df.boo == i].index.max() for i in df.boo.dropna().unique()]
df.loc[idx,"min"] = df.loc[idx].apply(find_min_in_range, axis=1)
Option 2: find idx using groupby+cumcount
. Also using lambda function. Better performance if you have a lot of open ranges (1 occurrence in "boo" column, like range 0
in the example)
import pandas as pd
import numpy as np
df["min"] = np.nan
idx = df[df.groupby('boo').cumcount() == 1].index
df.loc[idx,"min"] = df.loc[idx].apply(lambda row: df.loc[range(*df[df.boo == row["boo"]].index[[0,-1]]+[0,1]),"foo"].min(), axis=1)
Option 3: Using OOP + multiprocessing. Improves performance for large datasets
import pandas as pd
import numpy as np
from threading import Thread
class MinFinder():
def __init__(self,df):
self.df = df
self.range_idx = self.get_range_idx()
self.min_values = {}
def get_range_idx(self):
df_tmp = self.df.reset_index().dropna(subset="boo")
df_tmp.boo = df_tmp.boo.astype(int)
df_tmp = df_tmp.set_index("boo")
return df_tmp.groupby(df_tmp.index)["index"].unique().to_dict()
def find_min_in_range(self,r):
idx = self.range_idx[int(r)]
self.min_values[idx.max()] = self.df.loc[range(*df[df.boo == r].index[[0,-1]]+[0,1]),"foo"].min()
finder = MinFinder(df)
t = []
for r in df[df.groupby('boo').cumcount() == 1]["boo"].values:
t.append(Thread(target=finder.find_min_in_range, args=(r,)))
[thread.start() for thread in t]
[thread.join() for thread in t]
df["min"] = np.nan
df.loc[finder.min_values.keys(),"min"] = [v for v in finder.min_values.values()]
Output:
foo | boo | min | |
---|---|---|---|
0 | 36.3779 | nan | nan |
1 | 42.4897 | 1 | nan |
2 | 41.2237 | nan | nan |
3 | 32.2818 | 0 | nan |
4 | 22.8883 | 2 | nan |
5 | 12.848 | nan | nan |
6 | 6.87695 | nan | nan |
7 | -23.8729 | 1 | -23.8729 |
8 | -31.8589 | nan | nan |
9 | -39.4049 | 3 | nan |
10 | -47.7249 | 2 | -47.7249 |
11 | -4.81611 | 3 | -47.7249 |
Testing:
Option 1 summary: I've tested the code with 2M rows and 10k ranges and the test finished after 6:40 min
. While for 2M rows and 1k ranges it takes 42.80 segs
.
Option 3 summary: I've tested the code with 2M rows and 10k ranges and the test finished after 4:20 min
. While for 2M rows and 1k ranges it takes 31.40 segs
.
Tested using a MacbookPro,M1PRO,32gb
Option 1 results:
Rows | Ranges | Time (segs) |
---|---|---|
50000 | 100 | 0.13 |
150000 | 1100 | 3.70 |
250000 | 2100 | 10.20 |
350000 | 3100 | 22.00 |
450000 | 4100 | 38.00 |
550000 | 5100 | 58.20 |
650000 | 6100 | 84.00 |
750000 | 7100 | 114.50 |
850000 | 8100 | 145.16 |
950000 | 9100 | 180.70 |
2000000 | 10000 | 400.00 |
2000000 | 1000 | 42.8 |
Option 3 results:
Rows | Ranges | Time (segs) |
---|---|---|
50000 | 100 | 0.22 |
150000 | 1100 | 3.92 |
250000 | 2100 | 10.68 |
350000 | 3100 | 20.44 |
450000 | 4100 | 33.45 |
550000 | 5100 | 49.26 |
650000 | 6100 | 70.19 |
750000 | 7100 | 87.79 |
850000 | 8100 | 116.37 |
950000 | 9100 | 143.00 |
2000000 | 10000 | 260.00 |
2000000 | 1000 | 31.40 |
Code to create the test df:
num_rows = 2000000
num_ranges = 10000
foo_values = [random.uniform(-100, 100) for i in range(num_rows)]
boo_values = [i for i in range(num_ranges)]
boo_values.extend([i for i in range(num_ranges)])
a = np.empty(num_rows-len(boo_values))
a[:] = np.nan
boo_values.extend(a)
random.shuffle(boo_values)
df = pd.DataFrame({"foo": foo_values, "boo": boo_values})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With