Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding last possible index value to satisfy filtering requirements

I have a dataframe as follows:

Slot      Time     Last      Next
1         9:30               9:37
2         9:35     9:32      9:40
3         9:40     9:37      9:52
4         9:45     9:41      9:47
5         9:50     9:47      10:00

What I want to do here is to create two new columns 'min' and 'max', such that 'min' outputs the last possible slot with time < last; and 'max' outputs the last possible slot with time < next.

The desired output here should be:

df['min'] = [NaN,1,2,3,4]

and

df['max'] = [2,2,5,4,5]

I tried something along the lines of

for index, row in df.iterrows():
    row['min'] = df[df['Time'] < row['Last']]['Slot']

but got an empty list. Any help is greatly appreciated. Thanks!

like image 290
Adrian Y Avatar asked Nov 30 '18 08:11

Adrian Y


People also ask

How do you pick up the last value in a column?

If we wish to get only the first column number, we can use the MIN function to extract just the first column number, which will be the lowest number in the array. Once we get the first column, we can just add the total columns in the range and subtract 1, to get the last column number.

How do you find the last occurrence of a string in Excel?

You can use any character you want. Just make sure it's unique and doesn't appear in the string already. FIND(“@”,SUBSTITUTE(A2,”/”,”@”,LEN(A2)-LEN(SUBSTITUTE(A2,”/”,””))),1) – This part of the formula would give you the position of the last forward slash.


3 Answers

Firstly, I converted the date columns to datetime format, otherwise when you compare the strings, it only considers the first digit:

df = df_.copy()
df.loc[:, 'Time':'Next'] = df.loc[:, 'Time':'Next']
                             .apply(pd.to_datetime, errors='coerce')

For the min column you can do:

min_vals = [(df['Time'] < x)[::-1].idxmax() 
           if any(df['Time'] < x) else np.nan for x in df['Last']]
df_['min'] = df.loc[min_vals,'Slot'].values

And for the max :

max_vals = [(df['Time'] < x)[::-1].idxmax() 
           if any(df['Time'] < x) else np.nan for x in df['Next']]
df_.loc[:,'max'] = df.loc[max_vals,'Slot'].values

Which would give you:

print(df_)

    Slot  Time  Last   Next  min  max
0     1  9:30     -   9:37  NaN    2
1     2  9:35  9:32   9:40  1.0    2
2     3  9:40  9:37   9:52  2.0    5
3     4  9:45  9:41   9:47  3.0    4
4     5  9:50  9:47  10:00  4.0    5
like image 107
yatu Avatar answered Oct 29 '22 06:10

yatu


I tried this,

x=[]
y=[]
for index, row in df.iterrows():
    t=df[df['Time'] < row['Last']]['Slot'].values
    s=df[df['Time'] < row['Next']]['Slot'].values
    if len(t)==0:
    x.append(np.nan)
    else:
    x.append(t[-1])

    if len(s)==0:
    y.append(np.nan)
    else:
    y.append(s[-1])
df['min']=x
df['max']=y
print df

O/P:

   Slot                Time                Last                Next  min  max
0     1 2018-11-30 09:30:00                 NaT 2018-11-30 09:37:00  NaN    2
1     2 2018-11-30 09:35:00 2018-11-30 09:32:00 2018-11-30 09:40:00  1.0    2
2     3 2018-11-30 09:40:00 2018-11-30 09:37:00 2018-11-30 09:52:00  2.0    5
3     4 2018-11-30 09:45:00 2018-11-30 09:41:00 2018-11-30 09:47:00  3.0    4
4     5 2018-11-30 09:50:00 2018-11-30 09:47:00 2018-11-30 10:00:00  4.0    5

Note: It's a not a pandas way to solve this, as you attempted in loop, I suggest gave an idea to solve in for loop. It lags in performance.

like image 1
Mohamed Thasin ah Avatar answered Oct 29 '22 05:10

Mohamed Thasin ah


This is an occasion when numba can be helpful in providing an efficient solution. This is an explicit for loop, but JIT-compiled for performance.

from numba import njit

# convert to timedelta
time_cols = ['Time','Last','Next']
df[time_cols] = (df[time_cols] + ':00').apply(pd.to_timedelta)

# define loopy algorithm
@njit
def get_idx(times, comps, slots):
    n = len(times)
    res = np.empty(n)
    for i in range(n):
        mycomp = comps[i]
        if mycomp != mycomp:
            res[i] = np.nan
        else:
            for j in range(n, 0, -1):
                if times[j-1] < mycomp:
                    res[i] = slots[j-1]
                    break
            else:
                res[i] = np.nan
    return res

# extract timedeltas as seconds    
arr = df[time_cols].apply(lambda x: x.dt.total_seconds()).values

# apply logic    
df['min'] = get_idx(arr[:, 0], arr[:, 1], df['Slot'].values)
df['max'] = get_idx(arr[:, 0], arr[:, 2], df['Slot'].values)

Result

print(df)

   Slot     Time     Last     Next  min  max
0     1 09:30:00      NaT 09:37:00  NaN  2.0
1     2 09:35:00 09:32:00 09:40:00  1.0  2.0
2     3 09:40:00 09:37:00 09:52:00  2.0  5.0
3     4 09:45:00 09:41:00 09:47:00  3.0  4.0
4     5 09:50:00 09:47:00 10:00:00  4.0  5.0

Performance benchmarking

You can see massive performance improvements for larger dataframes:

def nix(df):
    min_vals = [(df['Time'] < x)[::-1].idxmax() 
           if any(df['Time'] < x) else np.nan for x in df['Last']]
    df['min'] = df.loc[min_vals,'Slot'].values

    max_vals = [(df['Time'] < x)[::-1].idxmax() 
               if any(df['Time'] < x) else np.nan for x in df['Next']]
    df.loc[:,'max'] = df.loc[max_vals,'Slot'].values
    return df

def jpp(df):
    arr = df[time_cols].apply(lambda x: x.dt.total_seconds()).values
    df['min'] = get_idx(arr[:, 0], arr[:, 1], df['Slot'].values)
    df['max'] = get_idx(arr[:, 0], arr[:, 2], df['Slot'].values)
    return df

df = pd.concat([df]*1000, ignore_index=True)

%timeit nix(df.copy())  # 8.85 s per loop
%timeit jpp(df.copy())  # 5.02 ms per loop

Related: Efficiently return the index of the first value satisfying condition in array.

like image 1
jpp Avatar answered Oct 29 '22 05:10

jpp