I want to compare the performance between 2 different methods to filter pandas DataFrames. So I created a test set with n
points in the plane and I filter out all points which are not in the unit square. I am surprised one method is so much faster than the other. The larger n
becomes the bigger the difference. What would be the explanation for that?
This is my script
import numpy as np
import time
import pandas as pd
# Test set with points
n = 100000
test_x_points = np.random.uniform(-10, 10, size=n)
test_y_points = np.random.uniform(-10, 10, size=n)
test_points = zip(test_x_points, test_y_points)
df = pd.DataFrame(test_points, columns=['x', 'y'])
# Method a
start_time = time.time()
result_a = df[(df['x'] < 1) & (df['x'] > -1) & (df['y'] < 1) & (df['y'] > -1)]
end_time = time.time()
elapsed_time_a = 1000 * abs(end_time - start_time)
# Method b
start_time = time.time()
result_b = df[df.apply(lambda row: -1 < row['x'] < 1 and -1 < row['y'] < 1, axis=1)]
end_time = time.time()
elapsed_time_b = 1000 * abs(end_time - start_time)
# print results
print 'For {0} points.'.format(n)
print 'Method a took {0} ms and leaves us with {1} elements.'.format(elapsed_time_a, len(result_a))
print 'Method b took {0} ms and leaves us with {1} elements.'.format(elapsed_time_b, len(result_b))
print 'Method a is {0} X faster than method b.'.format(elapsed_time_b / elapsed_time_a)
Results for different values of n
:
For 10 points.
Method a took 1.52087211609 ms and leaves us with 0 elements.
Method b took 0.456809997559 ms and leaves us with 0 elements.
Method a is 0.300360558081 X faster than method b.
For 100 points.
Method a took 1.55997276306 ms and leaves us with 1 elements.
Method b took 1.384973526 ms and leaves us with 1 elements.
Method a is 0.887819043252 X faster than method b.
For 1000 points.
Method a took 1.61004066467 ms and leaves us with 5 elements.
Method b took 10.448217392 ms and leaves us with 5 elements.
Method a is 6.48941211313 X faster than method b.
For 10000 points.
Method a took 1.59096717834 ms and leaves us with 115 elements.
Method b took 98.8278388977 ms and leaves us with 115 elements.
Method a is 62.1180878166 X faster than method b.
For 100000 points.
Method a took 2.14099884033 ms and leaves us with 1052 elements.
Method b took 995.483875275 ms and leaves us with 1052 elements.
Method a is 464.962360802 X faster than method b.
For 1000000 points.
Method a took 7.07101821899 ms and leaves us with 10045 elements.
Method b took 9613.26599121 ms and leaves us with 10045 elements.
Method a is 1359.5306494 X faster than method b.
When I compare it to Python native list comprehension method a is still much faster
result_c = [ (x, y) for (x, y) in test_points if -1 < x < 1 and -1 < y < 1 ]
Why is that?
If you follow the Pandas source code for apply
you will see that in general it ends up doing a python for __ in __
loop.
However, Pandas DataFrames are made up of Pandas Series, which are under the hood made up of numpy arrays. Masked filtering uses the fast, vectorized methods that numpy arrays allow. For info on why this is faster than doing plain python loops (as in .apply
), see Why are NumPy arrays so fast?
Top answer from there:
Numpy arrays are densely packed arrays of homogeneous type. Python lists, by contrast, are arrays of pointers to objects, even when all of them are of the same type. So, you get the benefits of locality of reference.
Also, many Numpy operations are implemented in C, avoiding the general cost of loops in Python, pointer indirection and per-element dynamic type checking. The speed boost depends on which operations you're performing, but a few orders of magnitude isn't uncommon in number crunching programs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With