I have a dataframe as follows:
Slot Time Last Next
1 9:30 9:37
2 9:35 9:32 9:40
3 9:40 9:37 9:52
4 9:45 9:41 9:47
5 9:50 9:47 10:00
What I want to do here is to create two new columns 'min' and 'max', such that 'min' outputs the last possible slot with time < last; and 'max' outputs the last possible slot with time < next.
The desired output here should be:
df['min'] = [NaN,1,2,3,4]
and
df['max'] = [2,2,5,4,5]
I tried something along the lines of
for index, row in df.iterrows():
row['min'] = df[df['Time'] < row['Last']]['Slot']
but got an empty list. Any help is greatly appreciated. Thanks!
If we wish to get only the first column number, we can use the MIN function to extract just the first column number, which will be the lowest number in the array. Once we get the first column, we can just add the total columns in the range and subtract 1, to get the last column number.
You can use any character you want. Just make sure it's unique and doesn't appear in the string already. FIND(“@”,SUBSTITUTE(A2,”/”,”@”,LEN(A2)-LEN(SUBSTITUTE(A2,”/”,””))),1) – This part of the formula would give you the position of the last forward slash.
Firstly, I converted the date columns to datetime format, otherwise when you compare the strings, it only considers the first digit:
df = df_.copy()
df.loc[:, 'Time':'Next'] = df.loc[:, 'Time':'Next']
.apply(pd.to_datetime, errors='coerce')
For the min
column you can do:
min_vals = [(df['Time'] < x)[::-1].idxmax()
if any(df['Time'] < x) else np.nan for x in df['Last']]
df_['min'] = df.loc[min_vals,'Slot'].values
And for the max
:
max_vals = [(df['Time'] < x)[::-1].idxmax()
if any(df['Time'] < x) else np.nan for x in df['Next']]
df_.loc[:,'max'] = df.loc[max_vals,'Slot'].values
Which would give you:
print(df_)
Slot Time Last Next min max
0 1 9:30 - 9:37 NaN 2
1 2 9:35 9:32 9:40 1.0 2
2 3 9:40 9:37 9:52 2.0 5
3 4 9:45 9:41 9:47 3.0 4
4 5 9:50 9:47 10:00 4.0 5
I tried this,
x=[]
y=[]
for index, row in df.iterrows():
t=df[df['Time'] < row['Last']]['Slot'].values
s=df[df['Time'] < row['Next']]['Slot'].values
if len(t)==0:
x.append(np.nan)
else:
x.append(t[-1])
if len(s)==0:
y.append(np.nan)
else:
y.append(s[-1])
df['min']=x
df['max']=y
print df
O/P:
Slot Time Last Next min max
0 1 2018-11-30 09:30:00 NaT 2018-11-30 09:37:00 NaN 2
1 2 2018-11-30 09:35:00 2018-11-30 09:32:00 2018-11-30 09:40:00 1.0 2
2 3 2018-11-30 09:40:00 2018-11-30 09:37:00 2018-11-30 09:52:00 2.0 5
3 4 2018-11-30 09:45:00 2018-11-30 09:41:00 2018-11-30 09:47:00 3.0 4
4 5 2018-11-30 09:50:00 2018-11-30 09:47:00 2018-11-30 10:00:00 4.0 5
Note: It's a not a pandas way to solve this, as you attempted in loop, I suggest gave an idea to solve in for loop. It lags in performance.
This is an occasion when numba
can be helpful in providing an efficient solution. This is an explicit for
loop, but JIT-compiled for performance.
from numba import njit
# convert to timedelta
time_cols = ['Time','Last','Next']
df[time_cols] = (df[time_cols] + ':00').apply(pd.to_timedelta)
# define loopy algorithm
@njit
def get_idx(times, comps, slots):
n = len(times)
res = np.empty(n)
for i in range(n):
mycomp = comps[i]
if mycomp != mycomp:
res[i] = np.nan
else:
for j in range(n, 0, -1):
if times[j-1] < mycomp:
res[i] = slots[j-1]
break
else:
res[i] = np.nan
return res
# extract timedeltas as seconds
arr = df[time_cols].apply(lambda x: x.dt.total_seconds()).values
# apply logic
df['min'] = get_idx(arr[:, 0], arr[:, 1], df['Slot'].values)
df['max'] = get_idx(arr[:, 0], arr[:, 2], df['Slot'].values)
Result
print(df)
Slot Time Last Next min max
0 1 09:30:00 NaT 09:37:00 NaN 2.0
1 2 09:35:00 09:32:00 09:40:00 1.0 2.0
2 3 09:40:00 09:37:00 09:52:00 2.0 5.0
3 4 09:45:00 09:41:00 09:47:00 3.0 4.0
4 5 09:50:00 09:47:00 10:00:00 4.0 5.0
Performance benchmarking
You can see massive performance improvements for larger dataframes:
def nix(df):
min_vals = [(df['Time'] < x)[::-1].idxmax()
if any(df['Time'] < x) else np.nan for x in df['Last']]
df['min'] = df.loc[min_vals,'Slot'].values
max_vals = [(df['Time'] < x)[::-1].idxmax()
if any(df['Time'] < x) else np.nan for x in df['Next']]
df.loc[:,'max'] = df.loc[max_vals,'Slot'].values
return df
def jpp(df):
arr = df[time_cols].apply(lambda x: x.dt.total_seconds()).values
df['min'] = get_idx(arr[:, 0], arr[:, 1], df['Slot'].values)
df['max'] = get_idx(arr[:, 0], arr[:, 2], df['Slot'].values)
return df
df = pd.concat([df]*1000, ignore_index=True)
%timeit nix(df.copy()) # 8.85 s per loop
%timeit jpp(df.copy()) # 5.02 ms per loop
Related: Efficiently return the index of the first value satisfying condition in array.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With