Grouping dataframe based on consecutive occurrence of values

Question

I have a pandas array which has one column which is either true or false (titled 'condition' in the example below). I would like to group the array by consecutive true or false values. I have tried to use pandas.groupby but haven't succeeded using that method, albeit I think that's down to my lack of understanding. An example of the dataframe can be found below:

df = pd.DataFrame(df)
print df
print df
index condition   H  t
0          1  2    1.1
1          1  7    1.5
2          0  1    0.9
3          0  6.5  1.6
4          1  7    1.1
5          1  9    1.8
6          1  22   2.0

Ideally the output of the program would be something along the lines of what can be found below. I was thinking of using some sort of 'grouping' method to make it easier to call each set of results but not sure if this is the best method. Any help would be greatly appreciated.

index condition   H  t group
0          1  2    1.1  1
1          1  7    1.5  1
2          0  1    0.9  2
3          0  6.5  1.6  2
4          1  7    1.1  3 
5          1  9    1.8  3
6          1  22   2.0  3

cs95 · Accepted Answer

Since you're dealing with 0/1s, here's another alternative using diff + cumsum -

df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) + 1    
df

       condition     H    t  group
index                             
0              1   2.0  1.1      1
1              1   7.0  1.5      1
2              0   1.0  0.9      2
3              0   6.5  1.6      2
4              1   7.0  1.1      3
5              1   9.0  1.8      3
6              1  22.0  2.0      3

If you don't mind floats, this can be made a little faster.

df['group'] = df.condition.diff().abs().cumsum() + 1
df.loc[0, 'group'] = 1
df

   index  condition     H    t  group
0      0          1   2.0  1.1    1.0
1      1          1   7.0  1.5    1.0
2      2          0   1.0  0.9    2.0
3      3          0   6.5  1.6    2.0
4      4          1   7.0  1.1    3.0
5      5          1   9.0  1.8    3.0
6      6          1  22.0  2.0    3.0

Here's the version with numpy equivalents -

df['group'] = 1
df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) + 1
df


       condition     H    t  group
index                             
0              1   2.0  1.1      1
1              1   7.0  1.5      1
2              0   1.0  0.9      2
3              0   6.5  1.6      2
4              1   7.0  1.1      3
5              1   9.0  1.8      3
6              1  22.0  2.0      3

On my machine, here are the timings -

df = pd.concat([df] * 100000, ignore_index=True)

%timeit df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) + 1 
10 loops, best of 3: 25.1 ms per loop

%%timeit
df['group'] = df.condition.diff().abs().cumsum() + 1
df.loc[0, 'group'] = 1

10 loops, best of 3: 23.4 ms per loop

%%timeit
df['group'] = 1
df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) + 1

10 loops, best of 3: 21.4 ms per loop

%timeit df['group'] = df['condition'].ne(df['condition'].shift()).cumsum()
100 loops, best of 3: 15.8 ms per loop

jezrael · Answer

Compare with ne (!=) by shifted column and then use cumsum:

df['group'] = df['condition'].ne(df['condition'].shift()).cumsum()
print (df)
       condition     H    t  group
index                             
0              1   2.0  1.1      1
1              1   7.0  1.5      1
2              0   1.0  0.9      2
3              0   6.5  1.6      2
4              1   7.0  1.1      3
5              1   9.0  1.8      3
6              1  22.0  2.0      3

Detail:

print (df['condition'].ne(df['condition'].shift()))
index
0     True
1    False
2     True
3    False
4     True
5    False
6    False
Name: condition, dtype: bool

Timings:

df = pd.concat([df]*100000).reset_index(drop=True)


In [54]: %timeit df['group'] = df['condition'].ne(df['condition'].shift()).cumsum()
100 loops, best of 3: 12.2 ms per loop

In [55]: %timeit df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) + 1
10 loops, best of 3: 24.5 ms per loop

In [56]: %%timeit
    ...: df['group'] = 1
    ...: df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) + 1
    ...: 
10 loops, best of 3: 26.6 ms per loop

Grouping dataframe based on consecutive occurrence of values

Tags:

python

pandas

dataframe

Rodrigo Prodohl

2 Answers

cs95

jezrael

Recent Activity

Donate For Us

Grouping dataframe based on consecutive occurrence of values

Tags:

python

pandas

dataframe

Rodrigo Prodohl

2 Answers

cs95

jezrael

Related questions

Recent Activity

Donate For Us