I have a pandas.DataFrame with measurements taken at consecutive points in time. Along with each measurement the system under observation had a distinct state at each point in time. Hence, the DataFrame also contains a column with the state of the system at each measurement. State changes are much slower than the measurement interval. As a result, the column indicating the states might look like this (index: state):
1: 3 2: 3 3: 3 4: 3 5: 4 6: 4 7: 4 8: 4 9: 1 10: 1 11: 1 12: 1 13: 1
Is there an easy way to retrieve the indices of each segment of consecutively equal states. That means I would like to get something like this:
[[1,2,3,4], [5,6,7,8], [9,10,11,12,13]]
The result might also be in something different than plain lists.
The only solution I could think of so far is manually iterating over the rows, finding segment change points and reconstructing the indices from these change points, but I have the hope that there is an easier solution.
How do you Count the Number of Occurrences in a data frame? To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.
diff() function. This function calculates the difference between two consecutive DataFrame elements. Parameters: periods: Represents periods to shift for computing difference, Integer type value.
To get the nth row in a Pandas DataFrame, we can use the iloc() method. For example, df. iloc[4] will return the 5th row because row numbers start from 0.
One-liner:
df.reset_index().groupby('A')['index'].apply(np.array)
Code for example:
In [1]: import numpy as np In [2]: from pandas import * In [3]: df = DataFrame([3]*4+[4]*4+[1]*4, columns=['A']) In [4]: df Out[4]: A 0 3 1 3 2 3 3 3 4 4 5 4 6 4 7 4 8 1 9 1 10 1 11 1 In [5]: df.reset_index().groupby('A')['index'].apply(np.array) Out[5]: A 1 [8, 9, 10, 11] 3 [0, 1, 2, 3] 4 [4, 5, 6, 7]
You can also directly access the information from the groupby object:
In [1]: grp = df.groupby('A') In [2]: grp.indices Out[2]: {1L: array([ 8, 9, 10, 11], dtype=int64), 3L: array([0, 1, 2, 3], dtype=int64), 4L: array([4, 5, 6, 7], dtype=int64)} In [3]: grp.indices[3] Out[3]: array([0, 1, 2, 3], dtype=int64)
To address the situation that DSM mentioned you could do something like:
In [1]: df['block'] = (df.A.shift(1) != df.A).astype(int).cumsum() In [2]: df Out[2]: A block 0 3 1 1 3 1 2 3 1 3 3 1 4 4 2 5 4 2 6 4 2 7 4 2 8 1 3 9 1 3 10 1 3 11 1 3 12 3 4 13 3 4 14 3 4 15 3 4
Now groupby both columns and apply the lambda function:
In [77]: df.reset_index().groupby(['A','block'])['index'].apply(np.array) Out[77]: A block 1 3 [8, 9, 10, 11] 3 1 [0, 1, 2, 3] 4 [12, 13, 14, 15] 4 2 [4, 5, 6, 7]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With