I have a dataframe that looks like this
>>> a_df
state
1 A
2 B
3 A
4 B
5 C
What I'd like to do, is to return all consecutive rows matching a certain sequence. For instance, if this sequence is ['A', 'B']
, then the rows whose state is A
followed immediately by a B
should be returned. In the above example:
>>> cons_criteria(a_df, ['A', 'B'])
state
1 A
2 B
3 A
4 B
Or if the chosen array is ['A', 'B', 'C']
, then the output should be
>>> cons_criteria(a_df, ['A', 'B', 'C'])
state
3 A
4 B
5 C
I decided to do this by storing the current state, as well as the next state:
>>> df2 = a_df.copy()
>>> df2['state_0'] = a_df['state']
>>> df2['state_1'] = a_df['state'].shift(-1)
Now, I can match with respect to state_0
and state_1
. But this only returns the very first entry:
>>> df2[(df2['state_0'] == 'A') & (df2['state_1'] == 'B')]
state
1 A
3 A
How should I fix the logic here so that all the consecutive rows are returned? Is there a better way to approach this in pandas?
You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.
diff() function. This function calculates the difference between two consecutive DataFrame elements. Parameters: periods: Represents periods to shift for computing difference, Integer type value.
The iloc() function in python is defined in the Pandas module that helps us to select a specific row or column from the data set. Using the iloc method in python, we can easily retrieve any particular value from a row or column by using index values.
I'd use a function like this
def match_slc(s, seq):
# get list, makes zip faster
l = s.values.tolist()
# count how many in sequence
k = len(seq)
# generate numpy array of rolling values
a = np.array(list(zip(*[l[i:] for i in range(k)])))
# slice an array from 0 to length of a - 1 with
# the truth values of wether all 3 in a sequence match
p = np.arange(len(a))[(a == seq).all(1)]
# p tracks the beginning of a match, get all subsequent
# indices of the match as well.
slc = np.unique(np.hstack([p + i for i in range(k)]))
return s.iloc[slc]
demonstration
s = pd.Series(list('ABABC'))
print(match_slc(s, list('ABC')), '\n')
print(match_slc(s, list('AB')), '\n')
2 A
3 B
4 C
dtype: object
0 A
1 B
2 A
3 B
dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With