Identifying consecutive occurrences of a value in a column of a pandas DataFrame

Tags:

pandas

I have a df like so:

and I want to return a 1 in a new column if there are two or more consecutive occurrences of 1 in Count and a 0 if there is not. So in the new column each row would get a 1 based on this criteria being met in the column Count. My desired output would then be:

Count  New_Value
1      0 
0      0
1      1
1      1
0      0
0      0
1      1
1      1 
1      1
0      0

I am thinking I may need to use itertools but I have been reading about it and haven't come across what I need yet. I would like to be able to use this method to count any number of consecutive occurrences, not just 2 as well. For example, sometimes I need to count 10 consecutive occurrences, I just use 2 in the example here.

975

asked Jun 21 '16 01:06

Stefano Potter

2 Answers

You could:

df['consecutive'] = df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count

to get:

   Count  consecutive
0      1            1
1      0            0
2      1            2
3      1            2
4      0            0
5      0            0
6      1            3
7      1            3
8      1            3
9      0            0

From here you can, for any threshold:

threshold = 2
df['consecutive'] = (df.consecutive > threshold).astype(int)

to get:

   Count  consecutive
0      1            0
1      0            0
2      1            1
3      1            1
4      0            0
5      0            0
6      1            1
7      1            1
8      1            1
9      0            0

or, in a single step:

(df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)

In terms of efficiency, using pandas methods provides a significant speedup when the size of the problem grows:

 df = pd.concat([df for _ in range(1000)])

%timeit (df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)
1000 loops, best of 3: 1.47 ms per loop

compared to:

%%timeit
l = []
for k, g in groupby(df.Count):
    size = sum(1 for _ in g)
    if k == 1 and size >= 2:
        l = l + [1]*size
    else:
        l = l + [0]*size    
pd.Series(l)

10 loops, best of 3: 76.7 ms per loop

108

answered Sep 20 '22 01:09

Stefan

Not sure if this is optimized, but you can give it a try:

from itertools import groupby
import pandas as pd

l = []
for k, g in groupby(df.Count):
    size = sum(1 for _ in g)
    if k == 1 and size >= 2:
        l = l + [1]*size
    else:
        l = l + [0]*size

df['new_Value'] = pd.Series(l)

df

Count   new_Value
0   1   0
1   0   0
2   1   1
3   1   1
4   0   0
5   0   0
6   1   1
7   1   1
8   1   1
9   0   0

answered Sep 19 '22 01:09

Psidom

Related questions
                            
                                python logging not saving to file
                            
                                How to remove the Xframe Options header in django?
                            
                                What is the purpose of response time distribution in locust.io?
                            
                                Installing a .whl Python package into a specific directory other than the default
                            
                                How to process RDDs using a Python class?
                            
                                pip doesn't work after upgrade
                            
                                Inheriting a patched class
                            
                                How to use PyMongo with Flask Blueprints?
                            
                                Example program of Cython as Python to C Converter
                            
                                How to use Tensorflow Optimizer without recomputing activations in reinforcement learning program that returns control after each iteration?
                            
                                Add missing date index in dataframe
                            
                                Python Pandas removing substring using another column
                            
                                Find elements that occur in some but not all lists
                            
                                Python 3 won't run from the Git Bash command line [duplicate]
                            
                                Disable warnings while pip installing packages
                            
                                Difference between sphinxcontrib.napoleon and numpy.numpydoc [closed]
                            
                                PyCharm - Auto Completion for matplotlib (and other imported modules)
                            
                                How to check if celery result backend is working
                            
                                logistic / sigmoid function implementation numerical precision
                            
                                simply use python anaconda without internet connection

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With