Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast, efficient way to remove rows from large Pandas DataFrame

Tags:

python

pandas

I'm looking to remove rows from a large Pandas DataFrame that contains analytics data based on actions/events users have done on a website. All streams of user actions begin with the start event, and finish with the end event. I want to find all users who have done a particular event (such as signed up - index 13 in the example dataframe) and remove all events after that event until (and including) the end event. So in this example, viewed blog post, page view, visited site, ad campaign hit, viewed blog post, visited site, page view, and end would have to be removed from the dataframe.

In [26]: data
Out[26]: 
    event            user
0   start            user1
1   visited blog     user1
2   page view        user1
3   visited blog     user1
4   viewed blog post user1
5   ad campaign hit  user1
6   page view        user1
7   visited site     user1
8   visited blog     user1
9   viewed blog post user1
10  visited site     user1
11  page view        user1
12  signed up        user1
13  viewed blog post user1
14  page view        user1
15  visited site     user1
16  ad campaign hit  user1
17  viewed blog post user1
18  visited site     user1
19  page view        user1
20  end              user1

I've tried to do this in a number of ways - using np.where() to identify the correct rows or

removal_starts_at = data[(data.user == 'user1') & (data.event == 'signed up')]
removal_ends_at = data[(data.user == 'user1') & (data.event == 'end')]
data[data.user == 'user1'].drop(data.index[removal_start_at+1:removal_ends_at+1], inplace=True)

However, this is really slow! It's takes ~20 seconds per user. I have 1000s of users, so this is not efficient. I'd like something that does it in a much quicker fashion if possible.

Another issue I discovered as I was writing up this question: If I don't include [data.user == 'user1'] to subset the dataframe, it goes crazy and takes up all the memory on the computer. If I do include it, it doesn't actually do the subsetting and gives me a warning about SettingWithCopy.

I'm relatively new to Pandas, so it's quite likely that there's a simpler way to do it and that I'm just doing it entirely incorrectly. Ideas I've thought about are using MultiIndex to find the combination of user & event directly or perhaps subsetting in a more efficient way?

like image 331
Navaneethan Santhanam Avatar asked Feb 10 '23 03:02

Navaneethan Santhanam


1 Answers

If I'm understanding correctly, the idea is you have lots of users in one dataframe. So I've expanded it to have 2 users. If that's right, then something like this ought to be pretty fast:

df['keep'] = np.where( df['event'] == 'start', 1, np.nan )
df['keep'] = np.where( df['event'].shift() == 'signed up', 0, df['keep'] )
df['keep'] = df['keep'].ffill()

               event   user  keep
0              start  user1     1
1       visited blog  user1     1
2          page view  user1     1
3          signed up  user1     1
4   viewed blog post  user1     0
5          page view  user1     0
6                end  user1     0
7              start  user2     1
8       visited blog  user2     1
9          signed up  user2     1
10  viewed blog post  user2     0
11               end  user2     0

df[df['keep']==1]

          event   user  keep
0         start  user1     1
1  visited blog  user1     1
2     page view  user1     1
3     signed up  user1     1
7         start  user2     1
8  visited blog  user2     1
9     signed up  user2     1
like image 162
JohnE Avatar answered Feb 12 '23 16:02

JohnE