Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast looping through Python dataframe with previous row reference

Suppose I have a pandas dataframe with two columns: ID and Days. DataFrame is sorted by ascending order in both variables. For example:

# Initial dataset
data = pd.DataFrame({'id': np.repeat([1, 2 ,3], 4),
                 'day': [1, 2, 10, 11, 3, 4, 12, 15, 1, 20, 21, 24]})

    id  day
0   1   1
1   1   2
2   1   10
3   1   11
4   2   3
5   2   4
6   2   12
7   2   15
8   3   1
9   3   20
10  3   21
11  3   24

I want to add a third column, which would give a "session" number for every ID*day. By "session" i mean a sequence of days with difference less than 2 days between days of one session. For example, sequence 5,6,7 will be considered as one session, while 5,6,9 will be considered as two sessions and should be marked as 0, 0, 1, i.e. day 5 and 6 were refered toi session #0, while day 9 was refered to session #1. Session numbers shuould start from 0 for each new ID.

In other words, what i want to get is:

    id  day session
0   1   1   0
1   1   2   0
2   1   10  1
3   1   11  1
4   2   3   0
5   2   4   0
6   2   12  1
7   2   15  2  
8   3   1   0
9   3   20  1
10  3   21  1
11  3   24  2

To solve this task I use basic for loop. In this loop i iteratevely run through all unique IDs, then subset a block of data from initial dataset and assign session numbers for every day of a specific ID. The problem i have - since the initial dataset is millions of lines - loop takes a looot of time! For example, for 1 mln of lines my loop spends about a minute, which is too much.

How to improve the speed? Any method is good! If you know how to achieve desired result, for instance, with some numpy matrix manipulations which will reduce time - also good ...

My code for the loop:

# Get sessions for every id
sessions = []
for i in data.id.unique():
    id_data = data['day'][data['id']==i].reset_index(drop=True)
    for ind in id_data.index:
        if ind == 0:
            temp = [0]
        elif ((id_data[ind] - id_data[ind - 1]) < 2):
            temp.append(temp[ind - 1])
        else:
            temp.append(temp[ind - 1] + 1)
    sessions.extend(temp)

# Add sessions to the table
data['session'] = sessions 
like image 272
I am Nik Avatar asked Oct 01 '18 21:10

I am Nik


1 Answers

You can sum Boolean

data.groupby('id').day.apply(lambda x : x.diff().gt(1).cumsum())
Out[614]: 
0     0
1     0
2     1
3     1
4     0
5     0
6     1
7     2
8     0
9     1
10    1
11    2
Name: day, dtype: int32
like image 69
BENY Avatar answered Oct 17 '22 05:10

BENY