Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group pandas dataframe in unusual way

Problem

I have the following Pandas dataframe:

    data = {
        'ID':  [100, 100, 100, 100, 200, 200, 200, 200, 200, 300, 300, 300, 300, 300],
        'value': [False, False, True, False, False, True, True, True, False, False, False, True, True, False],
    }
    df = pandas.DataFrame (data, columns = ['ID','value'])

I want to get the following groups:

  • Group 1: for each ID, all False rows until the first True row of that ID
  • Group 2: for each ID, all False rows after the last True row of that ID
  • Group 3: all true rows

enter image description here

Can this be done with pandas?

What I've tried

I've tried

group = df.groupby((df['value'].shift() != df['value']).cumsum())

but this returns an incorrect result.

like image 811
Ford1892 Avatar asked Sep 29 '20 14:09

Ford1892


People also ask

How do you get Groupby unique values in pandas?

To count unique values per groups in Python Pandas, we can use df. groupby('column_name'). count().

What does DF Groupby (' year ') do?

groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names.

How do you keep unique rows in pandas?

And you can use the following syntax to select unique rows across specific columns in a pandas DataFrame: df = df. drop_duplicates(subset=['col1', 'col2', ...])


3 Answers

Let us try shift + cumsum create the groupby key: BTW I really like the way you display your expected output

s = df.groupby('ID')['value'].apply(lambda x : x.ne(x.shift()).cumsum())
d = {x : y for x ,y in df.groupby(s)}
d[2]
     ID  value
2   100   True
5   200   True
6   200   True
7   200   True
11  300   True
12  300   True
d[1]
     ID  value
0   100  False
1   100  False
4   200  False
9   300  False
10  300  False
d[3]
     ID  value
3   100  False
8   200  False
13  300  False
like image 99
BENY Avatar answered Oct 11 '22 11:10

BENY


Let's try following your logic:

# 1. all False up to first True
group1 = df.loc[df.groupby('ID')['value'].cumsum() == 0]

# 2. all False after last True
group2 = df.loc[df.iloc[::-1].groupby('ID')['value'].cumsum()==0]

# 3. all True
group3 = df[df['value']]

Output:

    ID      value
0   100     False
1   100     False
4   200     False
9   300     False
10  300     False

    ID      value
3   100     False
8   200     False
13  300     False

    ID      value
2   100     True
5   200     True
6   200     True
7   200     True
11  300     True
12  300     True
like image 38
Quang Hoang Avatar answered Oct 11 '22 12:10

Quang Hoang


This works for your example data

df['groups'] = df.groupby('ID').value.apply(lambda x: x.diff().ne(False).cumsum()).astype('int')
for _,df_groups in df.groupby('groups'):
  print(df_groups)
  print('-'*20)

Out:

     ID  value  groups
0   100  False       1
1   100  False       1
4   200  False       1
9   300  False       1
10  300  False       1
--------------------
     ID  value  groups
2   100   True       2
5   200   True       2
6   200   True       2
7   200   True       2
11  300   True       2
12  300   True       2
--------------------
     ID  value  groups
3   100  False       3
8   200  False       3
13  300  False       3
--------------------
like image 1
Michael Szczesny Avatar answered Oct 11 '22 11:10

Michael Szczesny