How to split a dataframe based on consecutive index?

Tags:

I have a DataFrame 'work' with non consecutive index, here is an example:

Index Column1 Column2
4464  10.5    12.7
4465  11.3    12.8
4466  10.3    22.8
5123  11.3    21.8
5124  10.6    22.4
5323  18.6    23.5

I need to extract from this DataFrame new DataFrames containing only rows where the index is consecutive, so in this case my goal is to get

DF_1.index=[4464,4465,4466]
DF_2.index=[5123,5124]
DF_3.index=[5323]

maintaining all the columns.

Can anyone help me?

671

asked May 22 '19 12:05

Fabiogio

2 Answers

`groupby`

You can make a perfectly "consecutive" array with

np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

If I were to subtract this from an index that is monotonically increasing, only those index members that were "consecutive" would show up as equal. This is a clever way to establish a key to group by.

list_of_df = [d for _, d in df.groupby(df.index - np.arange(len(df)))]

And print each one to prove it

print(*list_of_df, sep='\n\n')

       Column1  Column2
Index                  
4464      10.5     12.7
4465      11.3     12.8
4466      10.3     22.8

       Column1  Column2
Index                  
5123      11.3     21.8
5124      10.6     22.4

       Column1  Column2
Index                  
5323      18.6     23.5

`np.split`

You can use np.flatnonzero to identify where the differences are not equal to 1 and avoid using cumsum and groupby

list_of_df = np.split(df, np.flatnonzero(np.diff(df.index) != 1) + 1)

Proof

print(*list_of_df, sep='\n\n')

       Column1  Column2
Index                  
4464      10.5     12.7
4465      11.3     12.8
4466      10.3     22.8

       Column1  Column2
Index                  
5123      11.3     21.8
5124      10.6     22.4

       Column1  Column2
Index                  
5323      18.6     23.5

134

answered Oct 20 '22 09:10

piRSquared

Here is an alternative:

grouper = (~(pd.Series(df.index).diff() == 1)).cumsum().values  
dfs = [dfx for _ , dfx in df.groupby(grouper)]

We use the fact that a continuous difference of 1 equals a sequence (diff == 1).

Full example:

import pandas as pd

data = '''\
Index Column1 Column2
4464  10.5    12.7
4465  11.3    12.8
4466  10.3    22.8
5123  11.3    21.8
5124  10.6    22.4
5323  18.6    23.5
'''

fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, sep='\s+', index_col='Index')

non_sequence = pd.Series(df.index).diff() != 1
grouper = non_sequence.cumsum().values
dfs = [dfx for _ , dfx in df.groupby(grouper)]

print(dfs[0])

#       Column1  Column2
#Index                  
#4464      10.5     12.7
#4465      11.3     12.8
#4466      10.3     22.8

Another way of seeing it is that we look for non-sequence to groupby, might be more readable:

non_sequence = pd.Series(df.index).diff() != 1
grouper = non_sequence.cumsum().values
dfs = [dfx for _ , dfx in df.groupby(grouper)]

answered Oct 20 '22 07:10

Anton vBR

Related questions
                            
                                Drop a dimension of a tensor in Tensorflow
                            
                                Write custom Data Generator for Keras
                            
                                Converting key=value pairs back into Python dicts
                            
                                pipenv sync and pipenv install --system --ignore-pipfile in docker environment
                            
                                Django REST Framework Web Login not working
                            
                                How do I remove name and dtype from pandas output
                            
                                Pandas groupby sequential values
                            
                                Numpy: Efficient way to convert indices of a square matrix to its upper triangular indices [closed]
                            
                                Convert string to timedelta in pandas
                            
                                Finding all the dicts of max len in a list of dicts
                            
                                How to set "nth" element of a nested python dict with given list location?
                            
                                pyenv failed to download a existing version of Python
                            
                                seaborn relplot: how to control the location of the legend and add title
                            
                                Pip can't install pillow
                            
                                Converting molecule name to SMILES?
                            
                                How to specify a minimum or maximum float value with argparse
                            
                                Developing a Python library with Pipenv
                            
                                How to efficiently unroll a matrix by value with numpy?
                            
                                How to convert a Tableau .hyper File to a pandas dataframe?
                            
                                How to kill tensorboard with Tensorflow2 (jupyter, Win)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to split a dataframe based on consecutive index?

Tags:

python

pandas

dataframe