Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split a dataframe based on consecutive index?

I have a DataFrame 'work' with non consecutive index, here is an example:

Index Column1 Column2
4464  10.5    12.7
4465  11.3    12.8
4466  10.3    22.8
5123  11.3    21.8
5124  10.6    22.4
5323  18.6    23.5

I need to extract from this DataFrame new DataFrames containing only rows where the index is consecutive, so in this case my goal is to get

DF_1.index=[4464,4465,4466]
DF_2.index=[5123,5124]
DF_3.index=[5323]

maintaining all the columns.

Can anyone help me?

like image 671
Fabiogio Avatar asked May 22 '19 12:05

Fabiogio


People also ask

How do you split a DataFrame into multiple data frames?

Here, we use the DataFrame. groupby() method for splitting the dataset by rows. The same grouped rows are taken as a single element and stored in a list. This list is the required output which consists of small DataFrames.

How do you split a DataFrame into parts?

Using the iloc() function to split DataFrame in Python We can use the iloc() function to slice DataFrames into smaller DataFrames. The iloc() function allows us to access elements based on the index of rows and columns. Using this function, we can split a DataFrame based on rows or columns.

Can you slice a DataFrame?

Slicing a DataFrame in Pandas includes the following steps:Ensure Python is installed (or install ActivePython) Import a dataset. Create a DataFrame. Slice the DataFrame.


2 Answers

groupby

You can make a perfectly "consecutive" array with

np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

If I were to subtract this from an index that is monotonically increasing, only those index members that were "consecutive" would show up as equal. This is a clever way to establish a key to group by.

list_of_df = [d for _, d in df.groupby(df.index - np.arange(len(df)))]

And print each one to prove it

print(*list_of_df, sep='\n\n')

       Column1  Column2
Index                  
4464      10.5     12.7
4465      11.3     12.8
4466      10.3     22.8

       Column1  Column2
Index                  
5123      11.3     21.8
5124      10.6     22.4

       Column1  Column2
Index                  
5323      18.6     23.5

np.split

You can use np.flatnonzero to identify where the differences are not equal to 1 and avoid using cumsum and groupby

list_of_df = np.split(df, np.flatnonzero(np.diff(df.index) != 1) + 1)

Proof

print(*list_of_df, sep='\n\n')

       Column1  Column2
Index                  
4464      10.5     12.7
4465      11.3     12.8
4466      10.3     22.8

       Column1  Column2
Index                  
5123      11.3     21.8
5124      10.6     22.4

       Column1  Column2
Index                  
5323      18.6     23.5
like image 134
piRSquared Avatar answered Oct 20 '22 09:10

piRSquared


Here is an alternative:

grouper = (~(pd.Series(df.index).diff() == 1)).cumsum().values  
dfs = [dfx for _ , dfx in df.groupby(grouper)]

We use the fact that a continuous difference of 1 equals a sequence (diff == 1).

Full example:

import pandas as pd

data = '''\
Index Column1 Column2
4464  10.5    12.7
4465  11.3    12.8
4466  10.3    22.8
5123  11.3    21.8
5124  10.6    22.4
5323  18.6    23.5
'''

fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, sep='\s+', index_col='Index')

non_sequence = pd.Series(df.index).diff() != 1
grouper = non_sequence.cumsum().values
dfs = [dfx for _ , dfx in df.groupby(grouper)]

print(dfs[0])

#       Column1  Column2
#Index                  
#4464      10.5     12.7
#4465      11.3     12.8
#4466      10.3     22.8

Another way of seeing it is that we look for non-sequence to groupby, might be more readable:

non_sequence = pd.Series(df.index).diff() != 1
grouper = non_sequence.cumsum().values
dfs = [dfx for _ , dfx in df.groupby(grouper)]
like image 25
Anton vBR Avatar answered Oct 20 '22 07:10

Anton vBR