Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas split dataframe into multiple when condition is true

I have a dataframe, like df below. I want to create a new dataframe for every chunk of data where the condition is true, so that it would be return df_1, df_2....df_n.

|      df           |       |  df_1 |   | df_2  |
| Value | Condition |       | Value |   | Value |
|-------|-----------|       |-------|---|-------|
| 2     | True      |   |   | 2     |   | 0     |
| 5     | True      |   |   | 5     |   | 5     |
| 4     | True      |   |   | 4     |   |       |
| 4     | False     |   |   |       |   |       |
| 2     | False     |   |   |       |   |       |
| 0     | True      |   |   |       |   |       |
| 5     | True      |   |   |       |   |       |
| 7     | False     |   |   |       |   |       |
| 8     | False     |   |   |       |   |       |      
| 9     | False     |   |   |       |   |       |

My only idea is to loop through the dataframe, returning the start and end index for every chunk of True values, then creating new dataframes with a loop going over the returned indices returning something like this for each start/end pair:

newdf = df.iloc[start:end]

But doing that seems inefficient.

like image 712
S.B.G Avatar asked Feb 13 '18 15:02

S.B.G


3 Answers

This is an alternative solution. Note the consecutive_groups recipe is from more_itertools library.

from itertools import groupby
from operator import itemgetter

def consecutive_groups(iterable, ordering=lambda x: x):
    for k, g in groupby(enumerate(iterable), key=lambda x: x[0] - ordering(x[1])):
        yield map(itemgetter(1), g)

grps = consecutive_groups(df[df.Condition].index)

dfs = {i: df.iloc[list(j)] for i, j in enumerate(grps, 1)}

# {1:    Value Condition
# 0      2      True
# 1      5      True
# 2      4      True,
# 2:    Value Condition
# 5      0      True
# 6      5      True}
like image 106
jpp Avatar answered Oct 04 '22 18:10

jpp


Create dictionary of DataFrames by Series created by cumsum of inverted boolean column and add NaNs for no groups by where:

g = (~df['Condition']).cumsum().where(df['Condition'])
print (g)
0    0.0
1    0.0
2    0.0
3    NaN
4    NaN
5    2.0
6    2.0
7    NaN
8    NaN
9    NaN
Name: Condition, dtype: float64

#enumerate for starting groups from 1, 2, N
dfs = {i+1:v for i, (k, v) in enumerate(df[['Value']].groupby(g))}
print (dfs)
{1:    Value
0      2
1      5
2      4, 2:    Value
5      0
6      5}

print (dfs[1])
   Value
0      2
1      5
2      4

print (dfs[2])
   Value
5      0
6      5
like image 30
jezrael Avatar answered Oct 04 '22 17:10

jezrael


I decided to provide an answer that places each 'Value' into its own column.

m = df.Condition.values
g = (~m).cumsum()
d = df.loc[m, 'Value']
g = g[m]
c = d.groupby(g).cumcount()
d.set_axis([c, g], inplace=False).unstack()

     0    2
0  2.0  0.0
1  5.0  5.0
2  4.0  NaN
like image 33
piRSquared Avatar answered Oct 04 '22 19:10

piRSquared