I have a DataFrame 'work' with non consecutive index, here is an example:
Index Column1 Column2
4464 10.5 12.7
4465 11.3 12.8
4466 10.3 22.8
5123 11.3 21.8
5124 10.6 22.4
5323 18.6 23.5
I need to extract from this DataFrame new DataFrames containing only rows where the index is consecutive, so in this case my goal is to get
DF_1.index=[4464,4465,4466]
DF_2.index=[5123,5124]
DF_3.index=[5323]
maintaining all the columns.
Can anyone help me?
Here, we use the DataFrame. groupby() method for splitting the dataset by rows. The same grouped rows are taken as a single element and stored in a list. This list is the required output which consists of small DataFrames.
Using the iloc() function to split DataFrame in Python We can use the iloc() function to slice DataFrames into smaller DataFrames. The iloc() function allows us to access elements based on the index of rows and columns. Using this function, we can split a DataFrame based on rows or columns.
Slicing a DataFrame in Pandas includes the following steps:Ensure Python is installed (or install ActivePython) Import a dataset. Create a DataFrame. Slice the DataFrame.
groupby
You can make a perfectly "consecutive" array with
np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
If I were to subtract this from an index that is monotonically increasing, only those index members that were "consecutive" would show up as equal. This is a clever way to establish a key to group by.
list_of_df = [d for _, d in df.groupby(df.index - np.arange(len(df)))]
And print each one to prove it
print(*list_of_df, sep='\n\n')
Column1 Column2
Index
4464 10.5 12.7
4465 11.3 12.8
4466 10.3 22.8
Column1 Column2
Index
5123 11.3 21.8
5124 10.6 22.4
Column1 Column2
Index
5323 18.6 23.5
np.split
You can use np.flatnonzero
to identify where the differences are not equal to 1
and avoid using cumsum
and groupby
list_of_df = np.split(df, np.flatnonzero(np.diff(df.index) != 1) + 1)
Proof
print(*list_of_df, sep='\n\n')
Column1 Column2
Index
4464 10.5 12.7
4465 11.3 12.8
4466 10.3 22.8
Column1 Column2
Index
5123 11.3 21.8
5124 10.6 22.4
Column1 Column2
Index
5323 18.6 23.5
Here is an alternative:
grouper = (~(pd.Series(df.index).diff() == 1)).cumsum().values
dfs = [dfx for _ , dfx in df.groupby(grouper)]
We use the fact that a continuous difference of 1 equals a sequence (diff == 1).
Full example:
import pandas as pd
data = '''\
Index Column1 Column2
4464 10.5 12.7
4465 11.3 12.8
4466 10.3 22.8
5123 11.3 21.8
5124 10.6 22.4
5323 18.6 23.5
'''
fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, sep='\s+', index_col='Index')
non_sequence = pd.Series(df.index).diff() != 1
grouper = non_sequence.cumsum().values
dfs = [dfx for _ , dfx in df.groupby(grouper)]
print(dfs[0])
# Column1 Column2
#Index
#4464 10.5 12.7
#4465 11.3 12.8
#4466 10.3 22.8
Another way of seeing it is that we look for non-sequence to groupby, might be more readable:
non_sequence = pd.Series(df.index).diff() != 1
grouper = non_sequence.cumsum().values
dfs = [dfx for _ , dfx in df.groupby(grouper)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With