filter for specific sequences involving multiple columns and surrounding rows

Question

I have data that looks like this:

enter image description here

It's standard financial price data (open, high, low, close).

In addition, I run some calculations. 'major_check' occasionally returns 1 or 2 (which 'minor_check' will then also return). 'minor_check' also returns 1 or 2, but more frequently. the rest is filled with 0 or NaN.

I'd like to test for specific patterns:

Whenever there is a 2 in 'major_check', I want to see if I can find a 21212 pattern in 'minor_check', with 21 preceding the central 2 and 12 following it.
If there is a 1 in 'major_check', I'd like to find a 12121 pattern in 'minor_check'

I highlighted a 21212 pattern in the screenshot to give a better idea on what I am looking for.

Once the 21212 or 12121 patterns are found, I'll check if specific rules applied on open/high/low/close (corresponding to the 5 rows constituting the pattern) are met or not.

Of course, one could naively iterate through the dataframe but this doesn't sound like the Pythonic way to do it.

I didn't manage to find a good way to do this, since a 21212 pattern can have some 0s inside it

OCa · Accepted Answer

As this answer by Timeless looked surprisingly complex, here is a quite simpler one.

Method:

Temporarily remove rows that are empty of test results (effectively skipping NaN and None),
Search for patterns
- either with numpy.where and pandas.shift to check for patterns row-wise (faster),
- or preferrably with pandas.rolling -probably faster, more compact, but still readable.
Finally report the findings into the original df.

You haven't specified how to flag the findings. Here they get marked as a True in two new columns, one for each pattern, appended to the original dataframe, for whatever use you would plan for them. They are called "hit1" and "hit2".

Input data
No text input data in your post, so until then I came up with my own. It is designed to produce one hit for each pattern:

"minor test" 12121 + "major test" 1 at index 14
"minor test" 21212 + "major test" 2 at index 12

import pandas as pd
import numpy as np

# Start dataframe
df = pd.DataFrame({'minor': [None,2,   1,   None,None,2,   None,2,   None,None,None,
                             1,   2,   None,1,   None,None,2,   None,1,   2,   None,None],
                   'major': [None,None,None,None,None,2,   None,None,None,None,None,
                             None,2,   None,1,   None,None,None,   None,None,None,None,None]})
df

    minor  major
0     NaN    NaN
1     2.0    NaN
2     1.0    NaN
3     NaN    NaN
4     NaN    NaN
5     2.0    2.0
6     NaN    NaN
7     2.0    NaN
8     NaN    NaN
9     NaN    NaN
10    NaN    NaN
11    1.0    NaN
12    2.0    2.0
13    NaN    NaN
14    1.0    1.0
15    NaN    NaN
16    NaN    NaN
17    2.0    NaN
18    NaN    NaN
19    1.0    NaN
20    2.0    NaN
21    NaN    NaN
22    NaN    NaN

Locate hits
Skip rows without test results

# Remove rows without minor test result
df1 = df.dropna(axis=0,subset='minor').copy()
# No reset_index because we'll use it to report back to df.

# Patterns of 'minor test' 
minor_pat1 = [1,2,1,2,1]
minor_pat2 = [2,1,2,1,2]

Pattern search:

alternative 1: .shift()and np.where

# Deploy shift columns, looking 2 values backwards and 2 forward
for i in range(-2,3):              # i in [-2,-1, 0, 1, 2]
    df1[i] = df1['minor'].shift(i) # create a column named i

# Test for both patterns and major test value
df1['hit1'] = np.where(df1['major']==1, # case 12121
                       df1[list(range(-2,3))].eq(minor_pat1).all(axis=1),
                       False)
df1['hit2'] = np.where(df1['major']==2, # case 21212
                       df1[list(range(-2,3))].eq(minor_pat2).all(axis=1),
                       False)
df1 # Temporary dataframe with findings located

    minor  major   -2   -1    0    1    2   hit1   hit2
1     2.0    NaN  2.0  1.0  2.0  NaN  NaN  False  False
2     1.0    NaN  2.0  2.0  1.0  2.0  NaN  False  False
5     2.0    2.0  1.0  2.0  2.0  1.0  2.0  False  False
7     2.0    NaN  2.0  1.0  2.0  2.0  1.0  False  False
11    1.0    NaN  1.0  2.0  1.0  2.0  2.0  False  False
12    2.0    2.0  2.0  1.0  2.0  1.0  2.0  False   True
14    1.0    1.0  1.0  2.0  1.0  2.0  1.0   True  False
17    2.0    NaN  2.0  1.0  2.0  1.0  2.0  False  False
19    1.0    NaN  NaN  2.0  1.0  2.0  1.0  False  False
20    2.0    NaN  NaN  NaN  2.0  1.0  2.0  False  False

alternative 2: one-liner with rolling, preferred:

.rolling() was designed for that purpose exactly.
Just too bad they haven't implemented .rolling().eq() yet (list of window functions).
This is why we must resort to apply .eq() from inside a lambda function.

df1['hit1'] = (df1['major']==1) & (df1['minor']
                                   .rolling(window=5, center=True)
                                   .apply(lambda x : x.eq(minor_pat1).all()))

df1['hit2'] = (df1['major']==2) & (df1['minor']
                                   .rolling(window=5, center=True)
                                   .apply(lambda x : x.eq(minor_pat2).all()))

    minor  major   hit1   hit2
1     2.0    NaN  False  False
2     1.0    NaN  False  False
5     2.0    2.0  False  False
7     2.0    NaN  False  False
11    1.0    NaN  False  False
12    2.0    2.0  False   True
14    1.0    1.0   True  False
17    2.0    NaN  False  False
19    1.0    NaN  False  False
20    2.0    NaN  False  False

Finally report back to original df

df.loc[df1.index,'hit1'] = df1['hit1']
df.loc[df1.index,'hit2'] = df1['hit2']
df

    minor  major   hit1   hit2
0     NaN    NaN    NaN    NaN
1     2.0    NaN  False  False
2     1.0    NaN  False  False
3     NaN    NaN    NaN    NaN
4     NaN    NaN    NaN    NaN
5     2.0    2.0  False  False
6     NaN    NaN    NaN    NaN
7     2.0    NaN  False  False
8     NaN    NaN    NaN    NaN
9     NaN    NaN    NaN    NaN
10    NaN    NaN    NaN    NaN
11    1.0    NaN  False  False
12    2.0    2.0  False   True
13    NaN    NaN    NaN    NaN
14    1.0    1.0   True  False
15    NaN    NaN    NaN    NaN
16    NaN    NaN    NaN    NaN
17    2.0    NaN  False  False
18    NaN    NaN    NaN    NaN
19    1.0    NaN  False  False
20    2.0    NaN  False  False
21    NaN    NaN    NaN    NaN
22    NaN    NaN    NaN    NaN

Timeless · Answer

A proposition using a hacky rolling :

def patfinder(df, pat):
    def fn(ser):
        d = dict(index=ser.index, dtype="int")
        return ser.mask(ser.eq(0)).eq(pd.Series([*pat], **d)).all()

    groups = ((center:=df["minor_check"].dropna().astype("int")
               .rolling(5, center=True).apply(fn).eq(1))
               .to_frame("m").join(center.mask(~center)
                        .cumsum().rename("g")))
    
    window = (groups.mask(~center)
               .ffill(limit=2).bfill(limit=2).fillna(False)
               .reindex(df.index, fill_value=False))
        
    return center & df["major_check"].eq(int(pat[2])), window

Test (on this df):

pat = "21212" # or "12121"

found, window = df.pipe(patfinder, pat)

if not df.loc[found].empty:
    print(df.loc[window["m"]].join(window["g"]))
    # print(df.loc[window["m"]]) # without counter-groups
    
          time  open  high  low  close  minor_check  major_check    g
20  02/03/2022  1.02  1.02 1.02   1.02         2.00          NaN 1.00
23  07/03/2022  1.01  1.02 1.00   1.01         1.00          NaN 1.00
25  09/03/2022  1.03  1.03 1.02   1.02         2.00         2.00 1.00
28  14/03/2022  1.03  1.03 1.03   1.03         1.00          NaN 1.00
31  17/03/2022  1.04  1.04 1.03   1.03         2.00          NaN 1.00

[5 rows x 8 columns]

filter for specific sequences involving multiple columns and surrounding rows

Tags:

python

pattern-matching

pandas

dataframe

pandas-rolling

Annathema

2 Answers

OCa

Timeless

Recent Activity

Donate For Us

filter for specific sequences involving multiple columns and surrounding rows

Tags:

python

pattern-matching

pandas

dataframe

pandas-rolling

Annathema

2 Answers

OCa

Timeless

Related questions

Recent Activity

Donate For Us