Why is filtering DataFrame by boolean mask so much faster than apply()?

Question

I want to compare the performance between 2 different methods to filter pandas DataFrames. So I created a test set with n points in the plane and I filter out all points which are not in the unit square. I am surprised one method is so much faster than the other. The larger n becomes the bigger the difference. What would be the explanation for that?

This is my script

import numpy as np
import time
import pandas as pd


# Test set with points
n              = 100000
test_x_points  = np.random.uniform(-10, 10, size=n)
test_y_points  = np.random.uniform(-10, 10, size=n)
test_points    = zip(test_x_points, test_y_points)
df             = pd.DataFrame(test_points, columns=['x', 'y'])


# Method a
start_time     = time.time()
result_a       = df[(df['x'] < 1) & (df['x'] > -1) & (df['y'] < 1) & (df['y'] > -1)]
end_time       = time.time()
elapsed_time_a = 1000 * abs(end_time - start_time)


# Method b
start_time     = time.time()
result_b       = df[df.apply(lambda row: -1 < row['x'] < 1 and -1 < row['y'] < 1, axis=1)]
end_time       = time.time()
elapsed_time_b = 1000 * abs(end_time - start_time)


# print results
print 'For {0} points.'.format(n)
print 'Method a took {0} ms and leaves us with {1} elements.'.format(elapsed_time_a, len(result_a))
print 'Method b took {0} ms and leaves us with {1} elements.'.format(elapsed_time_b, len(result_b))
print 'Method a is {0} X faster than method b.'.format(elapsed_time_b / elapsed_time_a)

Results for different values of n:

For 10 points.
Method a took 1.52087211609 ms and leaves us with 0 elements.
Method b took 0.456809997559 ms and leaves us with 0 elements.
Method a is 0.300360558081 X faster than method b.

For 100 points.
Method a took 1.55997276306 ms and leaves us with 1 elements.
Method b took 1.384973526 ms and leaves us with 1 elements.
Method a is 0.887819043252 X faster than method b.

For 1000 points.
Method a took 1.61004066467 ms and leaves us with 5 elements.
Method b took 10.448217392 ms and leaves us with 5 elements.
Method a is 6.48941211313 X faster than method b.

For 10000 points.
Method a took 1.59096717834 ms and leaves us with 115 elements.
Method b took 98.8278388977 ms and leaves us with 115 elements.
Method a is 62.1180878166 X faster than method b.

For 100000 points.
Method a took 2.14099884033 ms and leaves us with 1052 elements.
Method b took 995.483875275 ms and leaves us with 1052 elements.
Method a is 464.962360802 X faster than method b.

For 1000000 points.
Method a took 7.07101821899 ms and leaves us with 10045 elements.
Method b took 9613.26599121 ms and leaves us with 10045 elements.
Method a is 1359.5306494 X faster than method b.

When I compare it to Python native list comprehension method a is still much faster

result_c = [ (x, y) for (x, y) in test_points if -1 < x < 1 and -1 < y < 1 ]

Why is that?

LangeHaare · Accepted Answer

If you follow the Pandas source code for apply you will see that in general it ends up doing a python for __ in __ loop.

However, Pandas DataFrames are made up of Pandas Series, which are under the hood made up of numpy arrays. Masked filtering uses the fast, vectorized methods that numpy arrays allow. For info on why this is faster than doing plain python loops (as in .apply), see Why are NumPy arrays so fast?

Why is filtering DataFrame by boolean mask so much faster than apply()?

Tags:

python

pandas

dataframe

Elmex80s

1 Answers

LangeHaare

Recent Activity

Donate For Us

Why is filtering DataFrame by boolean mask so much faster than apply()?

Tags:

python

pandas

dataframe

Elmex80s

1 Answers

LangeHaare

Related questions

Recent Activity

Donate For Us