Fast looping through Python dataframe with previous row reference

Question

Suppose I have a pandas dataframe with two columns: ID and Days. DataFrame is sorted by ascending order in both variables. For example:

# Initial dataset
data = pd.DataFrame({'id': np.repeat([1, 2 ,3], 4),
                 'day': [1, 2, 10, 11, 3, 4, 12, 15, 1, 20, 21, 24]})

    id  day
0   1   1
1   1   2
2   1   10
3   1   11
4   2   3
5   2   4
6   2   12
7   2   15
8   3   1
9   3   20
10  3   21
11  3   24

I want to add a third column, which would give a "session" number for every ID*day. By "session" i mean a sequence of days with difference less than 2 days between days of one session. For example, sequence 5,6,7 will be considered as one session, while 5,6,9 will be considered as two sessions and should be marked as 0, 0, 1, i.e. day 5 and 6 were refered toi session #0, while day 9 was refered to session #1. Session numbers shuould start from 0 for each new ID.

In other words, what i want to get is:

    id  day session
0   1   1   0
1   1   2   0
2   1   10  1
3   1   11  1
4   2   3   0
5   2   4   0
6   2   12  1
7   2   15  2  
8   3   1   0
9   3   20  1
10  3   21  1
11  3   24  2

To solve this task I use basic for loop. In this loop i iteratevely run through all unique IDs, then subset a block of data from initial dataset and assign session numbers for every day of a specific ID. The problem i have - since the initial dataset is millions of lines - loop takes a looot of time! For example, for 1 mln of lines my loop spends about a minute, which is too much.

How to improve the speed? Any method is good! If you know how to achieve desired result, for instance, with some numpy matrix manipulations which will reduce time - also good ...

My code for the loop:

# Get sessions for every id
sessions = []
for i in data.id.unique():
    id_data = data['day'][data['id']==i].reset_index(drop=True)
    for ind in id_data.index:
        if ind == 0:
            temp = [0]
        elif ((id_data[ind] - id_data[ind - 1]) < 2):
            temp.append(temp[ind - 1])
        else:
            temp.append(temp[ind - 1] + 1)
    sessions.extend(temp)

# Add sessions to the table
data['session'] = sessions

BENY · Accepted Answer

You can sum Boolean

data.groupby('id').day.apply(lambda x : x.diff().gt(1).cumsum())
Out[614]: 
0     0
1     0
2     1
3     1
4     0
5     0
6     1
7     2
8     0
9     1
10    1
11    2
Name: day, dtype: int32

Fast looping through Python dataframe with previous row reference

Tags:

python

loops

pandas

I am Nik

1 Answers

BENY

Recent Activity

Donate For Us

Fast looping through Python dataframe with previous row reference

Tags:

python

loops

pandas

I am Nik

1 Answers

BENY

Related questions

Recent Activity

Donate For Us