I had a really simple problem and got some strange results when I tried to figure out which solution is faster.
The original Problem: Given two lists ListA
, ListB
and a constant k
, remove all entries where the two lists sum to k
.
I solved the problem in two ways: First I tried using a loop and then I used list comprehension and zip()
to zip and unzip the two lists.
The version using a loop.
def Remove_entries_simple(listA, listB, k):
""" removes entries that sum to k """
new_listA = []
new_listB = []
for index in range(len(listA)):
if listA[index] + listB[index] == k:
pass
else:
new_listA.append(listA[index])
new_listB.append(listB[index])
return(new_listA, new_listB)
The version using list comprehension and zip()
def Remove_entries_zip(listA, listB, k):
""" removes entries that sum to k using zip"""
zip_lists = [(a, b) for (a, b) in zip(listA, listB) if not (a+b) == k]
# unzip the lists
new_listA, new_listB = zip(*zip_lists)
return(list(new_listA), list(new_listB))
Then I tried to determine which approach is faster. But then I got what you see in figure below (x-axis: size of the lists, y-axis: average time to run it, 10**3 repetitions). For some reason the version using zip()
always makes the same jump at somewhat the same position -- I ran it multiple times and on different machines. Can someone explain what could cause such an odd behavior?
Update: The code I used to generate the plot. I used a function decorator to run each problem a 1000 times.
import statements:
import random
import time
import matplotlib.pyplot as plt
The function decorator:
def Repetition_Decorator(fun, Rep=10**2):
''' returns the average over Rep repetitions'''
def Return_function(*args, **kwargs):
Start_time = time.clock()
for _ in range(Rep):
fun(*args, **kwargs)
return (time.clock() - Start_time)/Rep
return Return_function
The code to create the plots:
Zippedizip = []
Loops = []
The_Number = 10
Size_list = list(range(10, 1000, 10))
Repeated_remove_loop = Repetition_Decorator(Remove_entries_simple, Rep=10**3)
Repeated_remove_zip = Repetition_Decorator(Remove_entries_zip, Rep=10**3)
for size in Size_list:
ListA = [random.choice(range(10)) for _ in range(size)]
ListB = [random.choice(range(10)) for _ in range(size)]
Loops.append(Repeated_remove_loop(ListA, ListB, The_Number))
Zippedizip.append(Repeated_remove_zip(ListA, ListB, The_Number))
plt.xlabel('Size of List')
plt.ylabel('Averaged time in seconds')
plt.plot(Size_list, Loops, label="Using Loop")
plt.plot(Size_list, Zippedizip, label="Zip")
plt.legend(loc='upper left', shadow=False, fontsize='x-large')
plt.show()
Update-Update: thanks to kaya3 for pointing out the timeit module.
To be as close as possible to my original code but also use the timeit module, I created a new function decorator that uses the timeit module for timing the code.
The new decorator:
def Repetition_Decorator_timeit(fun, Rep=10**2):
"""returns average over Rep repetitions with timeit"""
def Return_function(*args, **kwargs):
partial_fun = lambda: fun(*args, **kwargs)
return timeit.timeit(partial_fun, number=Rep) / Rep
return Return_function
When I use the new decorator the version using the for loop is not affected but the zip
version no longer makes the jump.
So far I feel quite certain that the jump is a result of how I measure the function rather than the function itself. But the jump is so distinct -- always at the same list size across different machines -- that it cannot be a fluke. Any ideas why exactly this jump happens?
Update-Update-Update:
It has something to do with the garbage collector, because if I disable the garbage collector with gc.disable()
, both ways of measuring give the same result.
What did I learn here: Do not just measure the time of execution yourself. Use the timeit
module for measuring performance of code snippets.
List comprehensions are faster than for loops to create lists. But, this is because we are creating a list by appending new elements to it at each iteration.
The for loop is a common way to iterate through a list. List comprehension, on the other hand, is a more efficient way to iterate through a list because it requires fewer lines of code. List comprehension requires less computation power than a for loop because it takes up less space and code.
The list comprehension is 50% faster.
Conclusions. List comprehensions are often not only more readable but also faster than using "for loops." They can simplify your code, but if you put too much logic inside, they will instead become harder to read and understand.
This seems to be an artifact of the way you have measured the running times. I do not know what causes your timing code to produce this effect, but the effect disappears when I use timeit
to measure the running times instead. I'm using Python 3.6.2.
I can consistently reproduce the effect using your timing code; I get the zip
version's running time jumping at around the same threshold, though it is still slightly faster than the other version on my machine:
However, when I measure the times using timeit
, the effect disappears completely:
Here's the code using timeit
; I changed as little as possible from your analysis code.
import timeit
Zippedizip = []
Loops = []
The_Number = 10
Size_list = list(range(10, 1000, 10))
Reps = 1000
for size in Size_list:
ListA = [random.choice(range(10)) for _ in range(size)]
ListB = [random.choice(range(10)) for _ in range(size)]
remove_loop = lambda: Remove_entries_simple(ListA, ListB, The_Number)
remove_zip = lambda: Remove_entries_zip(ListA, ListB, The_Number)
Loops.append(timeit.timeit(remove_loop, number=Reps) / Reps)
Zippedizip.append(timeit.timeit(remove_zip, number=Reps) / Reps)
# ...
So I think it's a spurious result. That said, I don't understand what's causing it in your timing code. I tried simplifying your timing code to not use a decorator or vargs, and I replaced time.clock()
with time.perf_counter()
which is more accurate, but that didn't change anything.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With