Suppose I have a pandas dataframe with two columns: ID and Days. DataFrame is sorted by ascending order in both variables. For example:
# Initial dataset
data = pd.DataFrame({'id': np.repeat([1, 2 ,3], 4),
'day': [1, 2, 10, 11, 3, 4, 12, 15, 1, 20, 21, 24]})
id day
0 1 1
1 1 2
2 1 10
3 1 11
4 2 3
5 2 4
6 2 12
7 2 15
8 3 1
9 3 20
10 3 21
11 3 24
I want to add a third column, which would give a "session" number for every ID*day. By "session" i mean a sequence of days with difference less than 2 days between days of one session. For example, sequence 5,6,7
will be considered as one session, while 5,6,9
will be considered as two sessions and should be marked as 0, 0, 1
, i.e. day 5 and 6 were refered toi session #0, while day 9 was refered to session #1.
Session numbers shuould start from 0
for each new ID.
In other words, what i want to get is:
id day session
0 1 1 0
1 1 2 0
2 1 10 1
3 1 11 1
4 2 3 0
5 2 4 0
6 2 12 1
7 2 15 2
8 3 1 0
9 3 20 1
10 3 21 1
11 3 24 2
To solve this task I use basic for loop. In this loop i iteratevely run through all unique IDs, then subset a block of data from initial dataset and assign session numbers for every day of a specific ID. The problem i have - since the initial dataset is millions of lines - loop takes a looot of time! For example, for 1 mln of lines my loop spends about a minute, which is too much.
How to improve the speed? Any method is good! If you know how to achieve desired result, for instance, with some numpy matrix manipulations which will reduce time - also good ...
My code for the loop:
# Get sessions for every id
sessions = []
for i in data.id.unique():
id_data = data['day'][data['id']==i].reset_index(drop=True)
for ind in id_data.index:
if ind == 0:
temp = [0]
elif ((id_data[ind] - id_data[ind - 1]) < 2):
temp.append(temp[ind - 1])
else:
temp.append(temp[ind - 1] + 1)
sessions.extend(temp)
# Add sessions to the table
data['session'] = sessions
You can sum Boolean
data.groupby('id').day.apply(lambda x : x.diff().gt(1).cumsum())
Out[614]:
0 0
1 0
2 1
3 1
4 0
5 0
6 1
7 2
8 0
9 1
10 1
11 2
Name: day, dtype: int32
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With