Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Dataframe split in to sessions

This is an extension to my question.

To make it simpler Lets suppose I have a pandas dataframe as following.

df = pd.DataFrame([[1.1, 1.1, 2.5, 2.6, 2.5, 3.4,2.6,2.6,3.4], list('AAABBBBAB'), [1.1, 1.7, 2.5, 2.6, 3.3, 3.8,4.0,4.2,4.3]]).T
df.columns = ['col1', 'col2','col3']

dataframe :

  col1 col2 col3
0  1.1    A  1.1
1  1.1    A  1.7
2  2.5    A  2.5
3  2.6    B  2.6
4  2.5    B  3.3
5  3.4    B  3.8
6  2.6    B    4
7  2.6    A  4.2
8  3.4    B  4.3

I want to group this based on some conditions. The logic is based on col1 col2 values and the cumulative difference of col3:

  1. Go to col1 and find other occurrences of the same value.
  2. In my case first value of col1 is '1.1' and again their is the same value at row2.
  3. Then check for col2 value, If they are similar, then get the cumulative difference of col 3.
  4. If the cumulative difference is greater than 0.5 then mark this as a new session.
  5. If col1 values are same but col2 values are different then mark them as new session

expected output:

   col1 col2 col3 session
0  1.1    A  1.1  0
1  1.1    A  1.7  1
2  2.5    A  2.5  2
3  2.6    B  2.6  4
4  2.5    B  3.3  3
5  3.4    B  3.8  7
6  2.6    B    4  5
7  2.6    A  4.2  6
8  3.4    B  4.3  7
like image 448
Nilani Algiriyage Avatar asked Jul 10 '13 10:07

Nilani Algiriyage


1 Answers

As in the excellent answer you linked to ;) first create the session number:

In [11]: g = df.groupby(['col1', 'col2'])

In [12]: df['session_number'] = g['col3'].apply(lambda s: (s - s.shift(1) > 0.5).fillna(0).cumsum(skipna=False))

Then I think you want to set_index of these columns, this may be enough for many usecases (though it might be worth doing a sort):

In [13]: df1 = df.set_index(['col1', 'col2', 'session_number'])

In [14]: df1
Out[14]:
                         col3
col1 col2 session_number
1.1  A    0               1.1
          1               1.7
2.5  A    0               2.5
2.6  B    0               2.6
2.5  B    0               3.3
3.4  B    0               3.8
2.6  B    1                 4
     A    0               4.2
3.4  B    0               4.3

If you really want you can grab out the session number :

In [15]: g1 = df.groupby(['col1', 'col2', 'session_number'])  # I think there is a slightly neater way, but I forget..

In [16]: df1['session'] = g1.apply(lambda x: 1).cumsum()  # could -1 here if it matters

In [17]: df1
Out[17]:
                         col3  session
col1 col2 session_number
1.1  A    0               1.1        1
          1               1.7        2
2.5  A    0               2.5        3
2.6  B    0               2.6        6
2.5  B    0               3.3        4
3.4  B    0               3.8        8
2.6  B    1                 4        7
     A    0               4.2        5
3.4  B    0               4.3        8

If you want this in columns (as in your question) the reset_index and you could delete the session column:

In [18]: df1.reset_index()
Out[18]:
   col1 col2  session_number col3  session
0   1.1    A               0  1.1        1
1   1.1    A               1  1.7        2
2   2.5    A               0  2.5        3
3   2.6    B               0  2.6        6
4   2.5    B               0  3.3        4
5   3.4    B               0  3.8        8
6   2.6    B               1    4        7
7   2.6    A               0  4.2        5
8   3.4    B               0  4.3        8
like image 190
Andy Hayden Avatar answered Sep 22 '22 10:09

Andy Hayden