Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding consecutive segments in a pandas data frame

Tags:

python

pandas

I have a pandas.DataFrame with measurements taken at consecutive points in time. Along with each measurement the system under observation had a distinct state at each point in time. Hence, the DataFrame also contains a column with the state of the system at each measurement. State changes are much slower than the measurement interval. As a result, the column indicating the states might look like this (index: state):

1:  3 2:  3 3:  3 4:  3 5:  4 6:  4 7:  4 8:  4 9:  1 10: 1 11: 1 12: 1 13: 1 

Is there an easy way to retrieve the indices of each segment of consecutively equal states. That means I would like to get something like this:

[[1,2,3,4], [5,6,7,8], [9,10,11,12,13]] 

The result might also be in something different than plain lists.

The only solution I could think of so far is manually iterating over the rows, finding segment change points and reconstructing the indices from these change points, but I have the hope that there is an easier solution.

like image 253
languitar Avatar asked Jan 16 '13 12:01

languitar


People also ask

How do you count occurrences in pandas?

How do you Count the Number of Occurrences in a data frame? To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.

How do you tell the difference between consecutive rows in pandas?

diff() function. This function calculates the difference between two consecutive DataFrame elements. Parameters: periods: Represents periods to shift for computing difference, Integer type value.

How do I find the 5th row of a data frame?

To get the nth row in a Pandas DataFrame, we can use the iloc() method. For example, df. iloc[4] will return the 5th row because row numbers start from 0.


1 Answers

One-liner:

df.reset_index().groupby('A')['index'].apply(np.array) 

Code for example:

In [1]: import numpy as np  In [2]: from pandas import *  In [3]: df = DataFrame([3]*4+[4]*4+[1]*4, columns=['A']) In [4]: df Out[4]:     A 0   3 1   3 2   3 3   3 4   4 5   4 6   4 7   4 8   1 9   1 10  1 11  1  In [5]: df.reset_index().groupby('A')['index'].apply(np.array) Out[5]: A 1    [8, 9, 10, 11] 3      [0, 1, 2, 3] 4      [4, 5, 6, 7] 

You can also directly access the information from the groupby object:

In [1]: grp = df.groupby('A')  In [2]: grp.indices Out[2]: {1L: array([ 8,  9, 10, 11], dtype=int64),  3L: array([0, 1, 2, 3], dtype=int64),  4L: array([4, 5, 6, 7], dtype=int64)}  In [3]: grp.indices[3] Out[3]: array([0, 1, 2, 3], dtype=int64) 

To address the situation that DSM mentioned you could do something like:

In [1]: df['block'] = (df.A.shift(1) != df.A).astype(int).cumsum()  In [2]: df Out[2]:     A  block 0   3      1 1   3      1 2   3      1 3   3      1 4   4      2 5   4      2 6   4      2 7   4      2 8   1      3 9   1      3 10  1      3 11  1      3 12  3      4 13  3      4 14  3      4 15  3      4 

Now groupby both columns and apply the lambda function:

In [77]: df.reset_index().groupby(['A','block'])['index'].apply(np.array) Out[77]: A  block 1  3          [8, 9, 10, 11] 3  1            [0, 1, 2, 3]    4        [12, 13, 14, 15] 4  2            [4, 5, 6, 7] 
like image 76
Zelazny7 Avatar answered Sep 20 '22 01:09

Zelazny7