I have a pandas.DataFrame with measurements taken at consecutive points in time. Along with each measurement the system under observation had a distinct state at each point in time. Hence, the DataFrame also contains a column with the state of the system at each measurement. State changes are much slower than the measurement interval. As a result, the column indicating the states might look like this (index: state):
1:  3 2:  3 3:  3 4:  3 5:  4 6:  4 7:  4 8:  4 9:  1 10: 1 11: 1 12: 1 13: 1   Is there an easy way to retrieve the indices of each segment of consecutively equal states. That means I would like to get something like this:
[[1,2,3,4], [5,6,7,8], [9,10,11,12,13]]   The result might also be in something different than plain lists.
The only solution I could think of so far is manually iterating over the rows, finding segment change points and reconstructing the indices from these change points, but I have the hope that there is an easier solution.
How do you Count the Number of Occurrences in a data frame? To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.
diff() function. This function calculates the difference between two consecutive DataFrame elements. Parameters: periods: Represents periods to shift for computing difference, Integer type value.
To get the nth row in a Pandas DataFrame, we can use the iloc() method. For example, df. iloc[4] will return the 5th row because row numbers start from 0.
One-liner:
df.reset_index().groupby('A')['index'].apply(np.array)   Code for example:
In [1]: import numpy as np  In [2]: from pandas import *  In [3]: df = DataFrame([3]*4+[4]*4+[1]*4, columns=['A']) In [4]: df Out[4]:     A 0   3 1   3 2   3 3   3 4   4 5   4 6   4 7   4 8   1 9   1 10  1 11  1  In [5]: df.reset_index().groupby('A')['index'].apply(np.array) Out[5]: A 1    [8, 9, 10, 11] 3      [0, 1, 2, 3] 4      [4, 5, 6, 7]   You can also directly access the information from the groupby object:
In [1]: grp = df.groupby('A')  In [2]: grp.indices Out[2]: {1L: array([ 8,  9, 10, 11], dtype=int64),  3L: array([0, 1, 2, 3], dtype=int64),  4L: array([4, 5, 6, 7], dtype=int64)}  In [3]: grp.indices[3] Out[3]: array([0, 1, 2, 3], dtype=int64)   To address the situation that DSM mentioned you could do something like:
In [1]: df['block'] = (df.A.shift(1) != df.A).astype(int).cumsum()  In [2]: df Out[2]:     A  block 0   3      1 1   3      1 2   3      1 3   3      1 4   4      2 5   4      2 6   4      2 7   4      2 8   1      3 9   1      3 10  1      3 11  1      3 12  3      4 13  3      4 14  3      4 15  3      4   Now groupby both columns and apply the lambda function:
In [77]: df.reset_index().groupby(['A','block'])['index'].apply(np.array) Out[77]: A  block 1  3          [8, 9, 10, 11] 3  1            [0, 1, 2, 3]    4        [12, 13, 14, 15] 4  2            [4, 5, 6, 7] 
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With