Use columns 1 and 2 to populate column 3

Question

I’m a Python newbie and have the following Pandas dataframe - I’m trying to write code that populates the ‘signal’ column as it is below:

Days	long_entry_flag	long_exit_flag	signal
1	FALSE	TRUE
2	FALSE	FALSE
3	TRUE	FALSE	1
4	TRUE	FALSE	1
5	FALSE	FALSE	1
6	TRUE	FALSE	1
7	TRUE	FALSE	1
8	FALSE	TRUE
9	FALSE	TRUE
10	TRUE	FALSE	1
11	TRUE	FALSE	1
12	TRUE	FALSE	1
13	FALSE	FALSE	1
14	FALSE	TRUE
15	FALSE	FALSE
16	FALSE	TRUE
17	TRUE	FALSE	1
18	TRUE	FALSE	1
19	FALSE	FALSE	1
20	FALSE	FALSE	1
21	FALSE	TRUE
22	FALSE	FALSE
23	FALSE	FALSE

My pseudocode version would take the following steps

Look down the [‘long_entry_flag’] column until entry condition is True (day 3 initially)
Then we enter ‘1’ into [‘signal’] column every day until exit condition is True [‘long_exit_flag’]==True on day 8
Then we look back to [‘long_entry_flag’] column to wait for the next entry condition (occurs on day 10)
And again we enter ‘1’ into [‘signal’] column every day until exit condition is True (day 14)
etc.

What are some ways to populate the ‘signal’ column rapidly if possible (using vectorisation?)?

This is a subset of a large dataframe with tens of thousands of rows, and it is one of many dataframes being analysed in sequence.

ayorgo · Accepted Answer

You can do

# Assuming we're starting from the "outside"
inside = False
for ix, row in df.iterrows():
    inside = (not row['long_exit_flag']
              if inside
              else row['long_entry_flag']
                  and not row['long_exit_flag']) # [True, True] case
    df.at[ix, 'signal'] = 1 if inside else np.nan

which is going to give you exactly the output you posted.

Being inspired by @jezrael's answer, I created a slightly more performant version of the above while still trying to keep it as neat as I could:

# Same assumption of starting from the "outside"
df.at[0, 'signal'] = df.at[0, 'long_entry_flag']
for ix in df.index[1:]:
    df.at[ix, 'signal'] = (not df.at[ix, 'long_exit_flag']
                           if df.at[ix - 1, 'signal']
                           else df.at[ix, 'long_entry_flag']
                               and not df.at[ix, 'long_exit_flag']) # [True, True] case

# Adjust to match the requested output exactly
df['signal'] = df['signal'].replace([True, False], [1, np.nan])

jezrael · Answer

For improving performance, use a Numba solution:

arr = df[['long_exit_flag','long_entry_flag']].values

@jit
def f(A):
    inside = False
    out = np.ones(len(A), dtype=float)
    for i in range(len(arr)):
        inside = not A[i, 0] if inside else A[i, 1]
        if not inside:
            out[i] = np.nan
    return out

df['signal'] = f(arr)

Performance:

#[21000 rows x 5 columns]
df = pd.concat([df] * 1000, ignore_index=True)

In [189]: %%timeit
     ...: inside = False
     ...: for ix, row in df.iterrows():
     ...:     inside = not row['long_exit_flag'] if inside else row['long_entry_flag']
     ...:     df.at[ix, 'signal'] = 1 if inside else np.nan
     ...:
1.58 s ± 9.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [190]: %%timeit
     ...: arr = df[['long_exit_flag','long_entry_flag']].values
     ...:
     ...: @jit
     ...: def f(A):
     ...:     inside = False
     ...:     out = np.ones(len(A), dtype=float)
     ...:     for i in range(len(arr)):
     ...:         inside = not A[i, 0] if inside else A[i, 1]
     ...:         if not inside:
     ...:             out[i] = np.nan
     ...:     return out
     ...:
     ...: df['signal'] = f(arr)
     ...:
171 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [200]: %%timeit
     ...: df['d'] = np.where(~df['long_exit_flag'],df['long_entry_flag'] | df['long_exit_flag'],np.nan)
     ...:
     ...: df['new_select']= np.where(df['d']==0, np.select([df['d'].shift()==0, df['d'].shift()==1],[1,1], np.nan), df['d'])
     ...:
2.4 ms ± 561 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You can also use NumPy for shifting, and also @Dark code is simplifying it:

In [222]: %%timeit
     ...: d = np.where(~df['long_exit_flag'].values,  df['long_entry_flag'].values | df['long_exit_flag'].values, np.nan)
     ...: shifted = np.insert(d[:-1], 0, np.nan)
     ...: m = (shifted==0) | (shifted==1)
     ...: df['signal1']= np.select([d!=0, m], [d, 1], np.nan)
     ...:
590 µs ± 35.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You can also check Does iterrows have performance issues? for the general order of precedence for performance of various operations in Pandas.

Use columns 1 and 2 to populate column 3

Tags:

python

pandas

Baz

2 Answers

ayorgo

jezrael

Recent Activity

Donate For Us

Use columns 1 and 2 to populate column 3

Tags:

python

pandas

Baz

2 Answers

ayorgo

jezrael

Related questions

Recent Activity

Donate For Us