I'm trying to improve the efficiency of a script that takes lists and counts how many items aren't in another 'master' list (list_of_all_items
).
It feels like there might be a more efficient way of doing this, perhaps through combining the querying somehow?
purple_count, brown_count, blue_count = 0, 0, 0
for item in list_of_purple_items:
if item not in list_of_all_items:
purple_count += 1
for item in list_of_brown_items:
if item not in list_of_all_items:
brown_list += 1
for item in list_of_blue_items:
if item not in list_of_all_items:
blue_count += 1
EDIT:
Thanks for your help. I ran a quick test to see what the best way was using a big test case:
my original: 30.21s
sets: 00.02s
filter: 30.01s
sum generator: 31.08s
Amazing how much more efficient it is using sets.
Thanks again all.
Use sets so you don't have to keep looping:
set_of_all_items = set(list_of_all_items)
purple_count = len(set(list_of_purple_items).difference(list_of_all_items))
brown_count = len(set(list_of_brown_items).difference(list_of_all_items))
blue_count = len(set(list_of_blue_items).difference(list_of_all_items))
This is far more efficient, because set intersections only require a loop over one of the two sets involved; each item can then be tested against the other set in constant time. The looping is done in C code (when creating the set
objects and when calculating the difference).
Using a set for all items is not really required as set.difference()
takes any iterable, but it is slightly faster:
>>> import timeit
>>> import random
>>> all = range(10000)
>>> random.shuffle(all)
>>> all[:-1000] = []
>>> some = [random.randrange(10000) for _ in range(1000)]
>>> timeit.timeit('len(set(some).difference(all))', 'from __main__ import some, all', number=10000)
0.9517788887023926
>>> timeit.timeit('len(set(some).difference(all))', 'from __main__ import some, all; all = set(all)', number=10000)
0.90407395362854
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With