Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use columns 1 and 2 to populate column 3

Tags:

python

pandas

I’m a Python newbie and have the following Pandas dataframe - I’m trying to write code that populates the ‘signal’ column as it is below:

Days long_entry_flag long_exit_flag signal
1 FALSE TRUE
2 FALSE FALSE
3 TRUE FALSE 1
4 TRUE FALSE 1
5 FALSE FALSE 1
6 TRUE FALSE 1
7 TRUE FALSE 1
8 FALSE TRUE
9 FALSE TRUE
10 TRUE FALSE 1
11 TRUE FALSE 1
12 TRUE FALSE 1
13 FALSE FALSE 1
14 FALSE TRUE
15 FALSE FALSE
16 FALSE TRUE
17 TRUE FALSE 1
18 TRUE FALSE 1
19 FALSE FALSE 1
20 FALSE FALSE 1
21 FALSE TRUE
22 FALSE FALSE
23 FALSE FALSE

My pseudocode version would take the following steps

  1. Look down the [‘long_entry_flag’] column until entry condition is True (day 3 initially)
  2. Then we enter ‘1’ into [‘signal’] column every day until exit condition is True [‘long_exit_flag’]==True on day 8
  3. Then we look back to [‘long_entry_flag’] column to wait for the next entry condition (occurs on day 10)
  4. And again we enter ‘1’ into [‘signal’] column every day until exit condition is True (day 14)
  5. etc.

What are some ways to populate the ‘signal’ column rapidly if possible (using vectorisation?)?

This is a subset of a large dataframe with tens of thousands of rows, and it is one of many dataframes being analysed in sequence.

like image 373
Baz Avatar asked Dec 11 '18 10:12

Baz


2 Answers

You can do

# Assuming we're starting from the "outside"
inside = False
for ix, row in df.iterrows():
    inside = (not row['long_exit_flag']
              if inside
              else row['long_entry_flag']
                  and not row['long_exit_flag']) # [True, True] case
    df.at[ix, 'signal'] = 1 if inside else np.nan

which is going to give you exactly the output you posted.


Being inspired by @jezrael's answer, I created a slightly more performant version of the above while still trying to keep it as neat as I could:

# Same assumption of starting from the "outside"
df.at[0, 'signal'] = df.at[0, 'long_entry_flag']
for ix in df.index[1:]:
    df.at[ix, 'signal'] = (not df.at[ix, 'long_exit_flag']
                           if df.at[ix - 1, 'signal']
                           else df.at[ix, 'long_entry_flag']
                               and not df.at[ix, 'long_exit_flag']) # [True, True] case

# Adjust to match the requested output exactly
df['signal'] = df['signal'].replace([True, False], [1, np.nan])
like image 119
ayorgo Avatar answered Sep 27 '22 18:09

ayorgo


For improving performance, use a Numba solution:

arr = df[['long_exit_flag','long_entry_flag']].values

@jit
def f(A):
    inside = False
    out = np.ones(len(A), dtype=float)
    for i in range(len(arr)):
        inside = not A[i, 0] if inside else A[i, 1]
        if not inside:
            out[i] = np.nan
    return out

df['signal'] = f(arr)

Performance:

#[21000 rows x 5 columns]
df = pd.concat([df] * 1000, ignore_index=True)

In [189]: %%timeit
     ...: inside = False
     ...: for ix, row in df.iterrows():
     ...:     inside = not row['long_exit_flag'] if inside else row['long_entry_flag']
     ...:     df.at[ix, 'signal'] = 1 if inside else np.nan
     ...:
1.58 s ± 9.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [190]: %%timeit
     ...: arr = df[['long_exit_flag','long_entry_flag']].values
     ...:
     ...: @jit
     ...: def f(A):
     ...:     inside = False
     ...:     out = np.ones(len(A), dtype=float)
     ...:     for i in range(len(arr)):
     ...:         inside = not A[i, 0] if inside else A[i, 1]
     ...:         if not inside:
     ...:             out[i] = np.nan
     ...:     return out
     ...:
     ...: df['signal'] = f(arr)
     ...:
171 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [200]: %%timeit
     ...: df['d'] = np.where(~df['long_exit_flag'],df['long_entry_flag'] | df['long_exit_flag'],np.nan)
     ...:
     ...: df['new_select']= np.where(df['d']==0, np.select([df['d'].shift()==0, df['d'].shift()==1],[1,1], np.nan), df['d'])
     ...:
2.4 ms ± 561 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You can also use NumPy for shifting, and also @Dark code is simplifying it:

In [222]: %%timeit
     ...: d = np.where(~df['long_exit_flag'].values,  df['long_entry_flag'].values | df['long_exit_flag'].values, np.nan)
     ...: shifted = np.insert(d[:-1], 0, np.nan)
     ...: m = (shifted==0) | (shifted==1)
     ...: df['signal1']= np.select([d!=0, m], [d, 1], np.nan)
     ...:
590 µs ± 35.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You can also check Does iterrows have performance issues? for the general order of precedence for performance of various operations in Pandas.

like image 34
jezrael Avatar answered Sep 27 '22 18:09

jezrael