Fastest way to drop rows / get subset with difference from large DataFrame in Pandas

Question

Question

I'm looking for the fastest way to drop a set of rows which indices I've got or get the subset of the difference of these indices (which results in the same dataset) from a large Pandas DataFrame.

So far I have two solutions, which seem relatively slow to me:

df.loc[df.difference(indices)]

which takes ~115 sec on my dataset
df.drop(indices)

which takes ~215 sec on my dataset

Is there a faster way to do this? Preferably in Pandas.

Performance of proposed Solutions

~41 sec: df[~df.index.isin(indices)] by @jezrael

jezrael · Accepted Answer

I believe you can create boolean mask, inverting by ~ and filtering by boolean indexing:

df1 = df[~df.index.isin(indices)]

As @user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df later is necessary add copy:

df1 = df[~df.index.isin(indices)].copy()

This filtering depends of number of matched indices and also by length of DataFrame.

So another possible solution is create array/list of indices for keeping and then inverting is not necessary:

df1 = df[df.index.isin(need_indices)]

user3471881 · Answer

Using iloc (or loc, see below) and Series.drop:

df = pd.DataFrame(np.arange(0, 1000000, 1))
indices = np.arange(0, 1000000, 3)

%timeit -n 100 df[~df.index.isin(indices)]
%timeit -n 100 df.iloc[df.index.drop(indices)]

41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

As @jezrael points out you can only use iloc if index is a RangeIndex otherwise you will have to use loc. But this is still faster than df[df.isin()] (see why below).

All three options on 10 million rows:

df = pd.DataFrame(np.arange(0, 10000000, 1))
indices = np.arange(0, 10000000, 3)

%timeit -n 10 df[~df.index.isin(indices)]
%timeit -n 10 df.iloc[df.index.drop(indices)]
%timeit -n 10 df.loc[df.index.drop(indices)]

4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Why does super slow loc outperform boolean_indexing?

Well, the short answer is that it doesn't. df.index.drop(indices) is just a lot faster than ~df.index.isin(indices) (given above data with 10 million rows):

%timeit -n 10 ~df.index.isin(indices)
%timeit -n 10 df.index.drop(indices)

4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

We can compare this to the performance of boolean_indexing vs iloc vs loc:

boolean_mask = ~df.index.isin(indices)
dropped_index = df.index.drop(indices)

%timeit -n 10 df[boolean_mask]
%timeit -n 10 df.iloc[dropped_index]
%timeit -n 10 df.loc[dropped_index]


489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Fastest way to drop rows / get subset with difference from large DataFrame in Pandas

Tags:

python

pandas

dataframe

Question

Performance of proposed Solutions

Ichixgo

2 Answers

jezrael

user3471881

Recent Activity

Donate For Us

Fastest way to drop rows / get subset with difference from large DataFrame in Pandas

Tags:

python

pandas

dataframe

Question

Performance of proposed Solutions

Ichixgo

2 Answers

jezrael

user3471881

Related questions

Recent Activity

Donate For Us