Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count groups of one in a column of pandas DataFrame [duplicate]

In dataframe I have column flag, I want to count groups of 1 in column

df=pd.DataFrame({'flag':[1,1,0,1,0,1,1,0,1,1,1]}) 


df_out=pd.DataFrame({'groups_of_one_count':[4]}) 
like image 772
Edward Avatar asked Dec 08 '22 10:12

Edward


2 Answers

With NumPy without any appending/concatenation, again for performance -

a = df.flag.values
out = (a[1:]>a[:-1]).sum() + (a[0]==1)

Explanation : We look for next element to be greater than the previous one. If satisfied, signals starts of an island/group of 1s. We just get the summation as the final output. For the corner case, when such a group starts from the first element, we capture it separately.

Timings with given sample scaled up by 10000x -

In [64]: df=pd.DataFrame({'flag':[1,1,0,1,0,1,1,0,1,1,1]})

In [65]: df = pd.concat([df]*10000)

# @Quang Hoang's soln
In [66]: %timeit (np.diff(np.append(df.flag.values,0)) == -1).sum()
362 µs ± 26.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# From this post
In [67]: %%timeit   
    ...: a = df.flag.values
    ...: out = (a[1:]>a[:-1]).sum() + (a[0]==1)
191 µs ± 5.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# @jezrael's soln
In [68]: %timeit (df['flag'].ne(df['flag'].shift()) & df['flag'].eq(1)).sum()
1.39 ms ± 8.86 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# @YOBEN_S's soln
In [69]: %timeit df[df.flag.ne(0)].index.to_series().diff().ne(1).sum()
2.92 ms ± 209 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Generic case (when there might be numbers other than just 0s and 1s)

The solution would modify to something along these lines, by getting the mask of number to be searched (1 here) and operating on it -

a = df.flag.values
m = a==1
out = (m[1:] & ~m[:-1]).sum() + m[0]
like image 52
Divakar Avatar answered Dec 10 '22 00:12

Divakar


Idea is compare consecutive groups by ne for not equal with Series.shift and filter only groups with 1:

a = (df['flag'].ne(df['flag'].shift()) & df['flag'].eq(1)).sum()
print (a)
4

df_out=pd.DataFrame({'groups_of_one_count':[a]}) 
print (df_out)
   groups_of_one_count
0                    4

Details:

print (df.assign(consec=df['flag'].ne(df['flag'].shift()),
                 eq1 = df['flag'].eq(1),
                 chained = (df['flag'].ne(df['flag'].shift()) & df['flag'].eq(1))
       ))
    flag  consec    eq1  chained
0      1    True   True     True
1      1   False   True    False
2      0    True  False    False
3      1    True   True     True
4      0    True  False    False
5      1    True   True     True
6      1   False   True    False
7      0    True  False    False
8      1    True   True     True
9      1   False   True    False
10     1   False   True    False
like image 27
jezrael Avatar answered Dec 09 '22 22:12

jezrael