Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count longest uninterrupted sequence in pandas

Tags:

python

pandas

Let's say I have pd.Series like below

s = pd.Series([False, True, False,True,True,True,False, False])    

0    False
1     True
2    False
3     True
4     True
5     True
6    False
7    False
dtype: bool

I want to know how long is the longest True sequence, in this example, it is 3.

I tried it in a stupid way.

s_list = s.tolist()
count = 0
max_count = 0
for item in s_list:
    if item:
        count +=1
    else:
        if count>max_count:
            max_count = count
        count = 0
print(max_count)

It will print 3, but in a Series of all True, it will print 0

like image 798
Dawei Avatar asked Feb 21 '18 02:02

Dawei


Video Answer


1 Answers

Option 1
Use a the series itself to mask the cumulative sum of the negation. Then use value_counts

(~s).cumsum()[s].value_counts().max()

3

explanation

  1. (~s).cumsum() is a pretty standard way to produce distinct True/False groups

    0    1
    1    1
    2    2
    3    2
    4    2
    5    2
    6    3
    7    4
    dtype: int64
    
  2. But you can see that the group we care about is represented by the 2s and there are four of them. That's because the group is initiated by the first False (which becomes True with (~s)). Therefore, we mask this cumulative sum with the boolean mask we started with.

    (~s).cumsum()[s]
    
    1    1
    3    2
    4    2
    5    2
    dtype: int64
    
  3. Now we see the three 2s pop out and we just have to use a method to extract them. I used value_counts and max.


Option 2
Use factorize and bincount

a = s.values
b = pd.factorize((~a).cumsum())[0]
np.bincount(b[a]).max()

3

explanation
This is a similar explanation as for option 1. The main difference is in how I a found the max. I use pd.factorize to tokenize the values into integers ranging from 0 to the total number of unique values. Given the actual values we had in (~a).cumsum() we didn't strictly need this part. I used it because it's a general purpose tool that could be used on arbitrary group names.

After pd.factorize I use those integer values in np.bincount which accumulates the total number of times each integer is used. Then take the maximum.


Option 3
As stated in the explanation of option 2, this also works:

a = s.values
np.bincount((~a).cumsum()[a]).max()

3
like image 77
piRSquared Avatar answered Sep 23 '22 23:09

piRSquared