Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looking for a sequential pattern with condition

I have a df as

  Id  Event SeqNo
   1    A    1
   1    B    2
   1    C    3
   1    ABD  4
   1    A    5
   1    C    6
   1    A    7
   1    CDE  8
   1    D    9
   1    B    10 
   1    ABD  11
   1    D    12
   1    B    13
   1    CDE  14
   1    A    15

I am looking for a pattern "ABD followed by CDE without having event B in between them " For example, The output of this df will be :

 Id  Event SeqNo
 1    ABD  4
 1    A    5
 1    C    6
 1    A    7
 1    CDE  8

This pattern can be followed multiple times for a single ID and I want find the list of all those IDs and their respective count (if possible).

like image 844
No_body Avatar asked Feb 06 '19 17:02

No_body


2 Answers

Here's a vectorized one with some scaling trickery and leveraging convolution to find the required pattern -

# Get the col in context and scale it to the three strings to form an ID array
a = df['Event']
id_ar = (a=='ABD') + 2*(a=='B') + 3*(a=='CDE')

# Mask of those specific strings and hence extract the corresponding masked df
mask = id_ar>0
df1 = df[mask]

# Get pattern col with 1s at places with the pattern found, 0s elsewhere
df1['Pattern'] = (np.convolve(id_ar[mask],[9,1],'same')==28).astype(int)

# Groupby Id col and sum the pattern col for final output
out = df1.groupby(['Id'])['Pattern'].sum()

That convolution part might be a bit tricky. The idea there is to use id_ar that has values of 1, 2 and 3 corresponding to strings 'ABD',''B' and 'CDE'. We are looking for 1 followed by 3, so using the convolution with a kernel [9,1] would result in 1*1 + 3*9 = 28 as the convolution sum for the window that has 'ABD' and then 'CDE'. Hence, we look for the conv. sum of 28 for the match. For the case of 'ABD' followed by ''B' and then 'CDE', conv. sum would be different, hence would be filtered out.

Sample run -

1) Input dataframe :

In [377]: df
Out[377]: 
   Id Event SeqNo
0   1     A     1
1   1     B     2
2   1     C     3
3   1   ABD     4
4   1     B     5
5   1     C     6
6   1     A     7
7   1   CDE     8
8   1     D     9
9   1     B    10
10  1   ABD    11
11  1     D    12
12  1     B    13
13  2     A     1
14  2     B     2
15  2     C     3
16  2   ABD     4
17  2     A     5
18  2     C     6
19  2     A     7
20  2   CDE     8
21  2     D     9
22  2     B    10
23  2   ABD    11
24  2     D    12
25  2     B    13
26  2   CDE    14
27  2     A    15

2) Intermediate filtered o/p (look at column Pattern for the presence of the reqd. pattern) :

In [380]: df1
Out[380]: 
   Id Event SeqNo  Pattern
1   1     B     2        0
3   1   ABD     4        0
4   1     B     5        0
7   1   CDE     8        0
9   1     B    10        0
10  1   ABD    11        0
12  1     B    13        0
14  2     B     2        0
16  2   ABD     4        0
20  2   CDE     8        1
22  2     B    10        0
23  2   ABD    11        0
25  2     B    13        0
26  2   CDE    14        0

3) Final o/p :

In [381]: out
Out[381]: 
Id
1    0
2    1
Name: Pattern, dtype: int64
like image 131
Divakar Avatar answered Nov 14 '22 13:11

Divakar


I used a solution based on the assumption that anything other than ABD,CDE and B is irrelevant to or solution. So I get rid of them first by a filtering operation.

Then, what I want to know if there is an ABD followed by a CDE without a B in between. I shift the Events column by one in time (note this doesn't have to be a 1 step in units of SeqNo).

Then I check every column of the new df whether Events==ABD and Events_1_Step==CDE meaning that there wasn't a B in between, but possibly other stuff like A or C or even nothing. This gets me a list of booleans for every time I have a sequence like that. If I sum them up, I get the count.

Finally, I have to make sure these are all done at Id level so use .groupby.

IMPORTANT: This solution is assumed that your df is sorted by Id first and then by SeqNo. If not, please do so.

import pandas as pd
df = pd.read_csv("path/to/file.csv")
df2 = df[df["Event"].isin(["ABD", "CDE", "B"])]
df2.loc[:,"Event_1_Step"] = df2["Event"].shift(-1)
df2.loc[:,"SeqNo_1_Step"] = df2["SeqNo"].shift(-1)
for id, id_df in df2.groupby("Id"):
    print(id) # Set a counter object here per Id to track count per id
    id_df = id_df[id_df.apply(lambda x: x["Event"] == "ABD" and x["Event_1_Step"] == "CDE", axis=1)]
    for row_id, row in id_df.iterrows():
        print(df[(df["Id"] == id) * df["SeqNo"].between(row["SeqNo"], row["SeqNo_1_Step"])])
like image 1
D_Serg Avatar answered Nov 14 '22 11:11

D_Serg