I'm looking to remove rows from a large Pandas DataFrame that contains analytics data based on actions/events users have done on a website. All streams of user actions begin with the start
event, and finish with the end
event. I want to find all users who have done a particular event (such as signed up
- index 13 in the example dataframe) and remove all events after that event until (and including) the end
event. So in this example, viewed blog post
, page view
, visited site
, ad campaign hit
, viewed blog post
, visited site
, page view
, and end
would have to be removed from the dataframe.
In [26]: data
Out[26]:
event user
0 start user1
1 visited blog user1
2 page view user1
3 visited blog user1
4 viewed blog post user1
5 ad campaign hit user1
6 page view user1
7 visited site user1
8 visited blog user1
9 viewed blog post user1
10 visited site user1
11 page view user1
12 signed up user1
13 viewed blog post user1
14 page view user1
15 visited site user1
16 ad campaign hit user1
17 viewed blog post user1
18 visited site user1
19 page view user1
20 end user1
I've tried to do this in a number of ways - using np.where()
to identify the correct rows or
removal_starts_at = data[(data.user == 'user1') & (data.event == 'signed up')]
removal_ends_at = data[(data.user == 'user1') & (data.event == 'end')]
data[data.user == 'user1'].drop(data.index[removal_start_at+1:removal_ends_at+1], inplace=True)
However, this is really slow! It's takes ~20 seconds per user. I have 1000s of users, so this is not efficient. I'd like something that does it in a much quicker fashion if possible.
Another issue I discovered as I was writing up this question:
If I don't include [data.user == 'user1']
to subset the dataframe, it goes crazy and takes up all the memory on the computer. If I do include it, it doesn't actually do the subsetting and gives me a warning about SettingWithCopy
.
I'm relatively new to Pandas, so it's quite likely that there's a simpler way to do it and that I'm just doing it entirely incorrectly. Ideas I've thought about are using MultiIndex
to find the combination of user & event directly or perhaps subsetting in a more efficient way?
If I'm understanding correctly, the idea is you have lots of users in one dataframe. So I've expanded it to have 2 users. If that's right, then something like this ought to be pretty fast:
df['keep'] = np.where( df['event'] == 'start', 1, np.nan )
df['keep'] = np.where( df['event'].shift() == 'signed up', 0, df['keep'] )
df['keep'] = df['keep'].ffill()
event user keep
0 start user1 1
1 visited blog user1 1
2 page view user1 1
3 signed up user1 1
4 viewed blog post user1 0
5 page view user1 0
6 end user1 0
7 start user2 1
8 visited blog user2 1
9 signed up user2 1
10 viewed blog post user2 0
11 end user2 0
df[df['keep']==1]
event user keep
0 start user1 1
1 visited blog user1 1
2 page view user1 1
3 signed up user1 1
7 start user2 1
8 visited blog user2 1
9 signed up user2 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With