I’m a Python newbie and have the following Pandas dataframe - I’m trying to write code that populates the ‘signal’ column as it is below:
Days | long_entry_flag | long_exit_flag | signal |
---|---|---|---|
1 | FALSE | TRUE | |
2 | FALSE | FALSE | |
3 | TRUE | FALSE | 1 |
4 | TRUE | FALSE | 1 |
5 | FALSE | FALSE | 1 |
6 | TRUE | FALSE | 1 |
7 | TRUE | FALSE | 1 |
8 | FALSE | TRUE | |
9 | FALSE | TRUE | |
10 | TRUE | FALSE | 1 |
11 | TRUE | FALSE | 1 |
12 | TRUE | FALSE | 1 |
13 | FALSE | FALSE | 1 |
14 | FALSE | TRUE | |
15 | FALSE | FALSE | |
16 | FALSE | TRUE | |
17 | TRUE | FALSE | 1 |
18 | TRUE | FALSE | 1 |
19 | FALSE | FALSE | 1 |
20 | FALSE | FALSE | 1 |
21 | FALSE | TRUE | |
22 | FALSE | FALSE | |
23 | FALSE | FALSE |
My pseudocode version would take the following steps
What are some ways to populate the ‘signal’ column rapidly if possible (using vectorisation?)?
This is a subset of a large dataframe with tens of thousands of rows, and it is one of many dataframes being analysed in sequence.
You can do
# Assuming we're starting from the "outside"
inside = False
for ix, row in df.iterrows():
inside = (not row['long_exit_flag']
if inside
else row['long_entry_flag']
and not row['long_exit_flag']) # [True, True] case
df.at[ix, 'signal'] = 1 if inside else np.nan
which is going to give you exactly the output you posted.
Being inspired by @jezrael's answer, I created a slightly more performant version of the above while still trying to keep it as neat as I could:
# Same assumption of starting from the "outside"
df.at[0, 'signal'] = df.at[0, 'long_entry_flag']
for ix in df.index[1:]:
df.at[ix, 'signal'] = (not df.at[ix, 'long_exit_flag']
if df.at[ix - 1, 'signal']
else df.at[ix, 'long_entry_flag']
and not df.at[ix, 'long_exit_flag']) # [True, True] case
# Adjust to match the requested output exactly
df['signal'] = df['signal'].replace([True, False], [1, np.nan])
For improving performance, use a Numba solution:
arr = df[['long_exit_flag','long_entry_flag']].values
@jit
def f(A):
inside = False
out = np.ones(len(A), dtype=float)
for i in range(len(arr)):
inside = not A[i, 0] if inside else A[i, 1]
if not inside:
out[i] = np.nan
return out
df['signal'] = f(arr)
Performance:
#[21000 rows x 5 columns]
df = pd.concat([df] * 1000, ignore_index=True)
In [189]: %%timeit
...: inside = False
...: for ix, row in df.iterrows():
...: inside = not row['long_exit_flag'] if inside else row['long_entry_flag']
...: df.at[ix, 'signal'] = 1 if inside else np.nan
...:
1.58 s ± 9.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [190]: %%timeit
...: arr = df[['long_exit_flag','long_entry_flag']].values
...:
...: @jit
...: def f(A):
...: inside = False
...: out = np.ones(len(A), dtype=float)
...: for i in range(len(arr)):
...: inside = not A[i, 0] if inside else A[i, 1]
...: if not inside:
...: out[i] = np.nan
...: return out
...:
...: df['signal'] = f(arr)
...:
171 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [200]: %%timeit
...: df['d'] = np.where(~df['long_exit_flag'],df['long_entry_flag'] | df['long_exit_flag'],np.nan)
...:
...: df['new_select']= np.where(df['d']==0, np.select([df['d'].shift()==0, df['d'].shift()==1],[1,1], np.nan), df['d'])
...:
2.4 ms ± 561 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You can also use NumPy for shifting, and also @Dark code is simplifying it:
In [222]: %%timeit
...: d = np.where(~df['long_exit_flag'].values, df['long_entry_flag'].values | df['long_exit_flag'].values, np.nan)
...: shifted = np.insert(d[:-1], 0, np.nan)
...: m = (shifted==0) | (shifted==1)
...: df['signal1']= np.select([d!=0, m], [d, 1], np.nan)
...:
590 µs ± 35.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You can also check Does iterrows have performance issues? for the general order of precedence for performance of various operations in Pandas.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With