Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas - Selecting pair of consecutive rows matching criteria

I have a dataframe that looks like this

>>> a_df
    state
1    A
2    B
3    A
4    B
5    C

What I'd like to do, is to return all consecutive rows matching a certain sequence. For instance, if this sequence is ['A', 'B'], then the rows whose state is A followed immediately by a B should be returned. In the above example:

>>> cons_criteria(a_df, ['A', 'B'])
    state
1    A
2    B
3    A
4    B

Or if the chosen array is ['A', 'B', 'C'], then the output should be

>>> cons_criteria(a_df, ['A', 'B', 'C'])
    state
3    A
4    B
5    C

I decided to do this by storing the current state, as well as the next state:

>>> df2 = a_df.copy()
>>> df2['state_0'] = a_df['state']
>>> df2['state_1'] = a_df['state'].shift(-1)

Now, I can match with respect to state_0 and state_1. But this only returns the very first entry:

>>> df2[(df2['state_0'] == 'A') & (df2['state_1'] == 'B')]
    state
1    A
3    A

How should I fix the logic here so that all the consecutive rows are returned? Is there a better way to approach this in pandas?

like image 468
user1496984 Avatar asked Dec 14 '16 04:12

user1496984


People also ask

How do you group similar rows in pandas?

You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.

How do you tell the difference between consecutive rows in pandas?

diff() function. This function calculates the difference between two consecutive DataFrame elements. Parameters: periods: Represents periods to shift for computing difference, Integer type value.

What is ILOC method?

The iloc() function in python is defined in the Pandas module that helps us to select a specific row or column from the data set. Using the iloc method in python, we can easily retrieve any particular value from a row or column by using index values.


1 Answers

I'd use a function like this

def match_slc(s, seq):
    # get list, makes zip faster
    l = s.values.tolist()
    # count how many in sequence
    k = len(seq)
    # generate numpy array of rolling values
    a = np.array(list(zip(*[l[i:] for i in range(k)])))
    # slice an array from 0 to length of a - 1 with 
    # the truth values of wether all 3 in a sequence match
    p = np.arange(len(a))[(a == seq).all(1)]
    # p tracks the beginning of a match, get all subsequent
    # indices of the match as well.
    slc = np.unique(np.hstack([p + i for i in range(k)]))
    return s.iloc[slc]

demonstration

s = pd.Series(list('ABABC'))

print(match_slc(s, list('ABC')), '\n')
print(match_slc(s, list('AB')), '\n')

2    A
3    B
4    C
dtype: object 

0    A
1    B
2    A
3    B
dtype: object 
like image 154
piRSquared Avatar answered Sep 22 '22 23:09

piRSquared