I was trying to improve the performance of the func function and I found that a simple change in how the aX list is generated improves the performance quite a bit:
import timeit
import numpy as np
def func(a, b):
return [_ for _ in a if _ not in b]
Na, Nb = 10000, 5000
b = list(np.random.randint(1000, size=Nb))
# Ordered list of Na integers
a1 = [_ for _ in range(Na)]
# Random list of Na integers
a2 = list(np.random.randint(Na, size=Na))
# Ordered list of Na integers generated with numpy
a3 = list(np.arange(Na))
start_time = timeit.default_timer()
ab1 = func(a1, b)
abt1 = timeit.default_timer() - start_time
print("Time ab1", abt1)
start_time = timeit.default_timer()
ab2 = func(a2, b)
abt2 = timeit.default_timer() - start_time
print("Time ab2", abt2)
start_time = timeit.default_timer()
ab3 = func(a3, b)
abt3 = timeit.default_timer() - start_time
print("Time ab3", abt3)
print("Ratio 1/2:", abt1 / abt2)
print("Ratio 1/3:", abt1 / abt3)
In Python 2.7.13 this results in:
('Time ab1', 5.296088933944702)
('Time ab2', 1.5520200729370117)
('Time ab3', 1.5581469535827637)
('Ratio 1/2:', 3.412384302428827)
('Ratio 1/3:', 3.3989662667998095)
In Python 3.5.2 the difference is even larger:
Time ab1 6.758207322000089
Time ab2 1.5693355060011527
Time ab3 1.5148192759988888
Ratio 1/2: 4.306413317073784
Ratio 1/3: 4.461395117608107
I need to process an ordered list integers (i.e: a1 or a3), so my question is:
Why is the random list processed so much faster than the ordered list not generated with numpy?
Your b, a2, and a3 lists are lists of NumPy scalars, while your a1 list is a list of ordinary Python ints. Comparing NumPy scalars to ordinary Python scalars requires a lot of extra type checking and coercion, so the func(a1, b) test, which needs to compare NumPy scalars to ordinary Python scalars, performs slowest.
If you make b a list of Python ints (by calling the tolist method instead of the list function), the time difference is reversed.
You may want to consider using Python sets or NumPy's set-like operations to perform your task.
As discussed here numpy arrays are much faster than python lists. This is why the numpy arrays seem faster as you are still using a numpy array when you call the list() function.
Using the numpy .tolist() function converts a NumPy array to regular Python objects all the way down (as user2357112 pointed out) and the performance differences disappear, see:
import timeit
import numpy as np
def func(a, b):
return [_ for _ in a if _ not in b]
Na, Nb = 10000, 5000
b = list(np.random.randint(Na, size=Nb)) # len: 5000, max: 9999
# Ordered list of Na integers
a1 = [_ for _ in range(Na)] # len: 10000, max: 9999
# Random list of Na integers
a2 = np.random.randint(Na, size=Na).tolist() # len: 10000, max: 9999
# Ordered list of Na integers generated with numpy
a3 = np.arange(Na).tolist()
start_time = timeit.default_timer()
ab1 = func(a1, b)
abt1 = timeit.default_timer() - start_time
print("Time ab1", abt1)
start_time = timeit.default_timer()
ab2 = func(a2, b)
abt2 = timeit.default_timer() - start_time
print("Time ab2", abt2)
start_time = timeit.default_timer()
ab3 = func(a3, b)
abt3 = timeit.default_timer() - start_time
print("Time ab3", abt3)
print("Ratio 1/2:", abt1 / abt2)
print("Ratio 1/3:", abt1 / abt3)
#Time ab1 4.622085004015502
#Time ab2 4.598610720638726
#Time ab3 4.63976530848255
#Ratio 1/2: 1.005104646773301
#Ratio 1/3: 0.9961893968139456
Hopefully this answers your first question!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With