Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way to loop finding items not in a list in python

I'm trying to improve the efficiency of a script that takes lists and counts how many items aren't in another 'master' list (list_of_all_items).

It feels like there might be a more efficient way of doing this, perhaps through combining the querying somehow?

purple_count, brown_count, blue_count = 0, 0, 0

for item in list_of_purple_items:
    if item not in list_of_all_items:
        purple_count += 1

for item in list_of_brown_items:
    if item not in list_of_all_items:
        brown_list += 1

for item in list_of_blue_items:
    if item not in list_of_all_items:
        blue_count += 1

EDIT:

Thanks for your help. I ran a quick test to see what the best way was using a big test case:

    my original: 30.21s
           sets: 00.02s
         filter: 30.01s
  sum generator: 31.08s

Amazing how much more efficient it is using sets.

Thanks again all.

like image 865
seymourgoestohollywood Avatar asked Dec 06 '22 22:12

seymourgoestohollywood


1 Answers

Use sets so you don't have to keep looping:

set_of_all_items = set(list_of_all_items)
purple_count = len(set(list_of_purple_items).difference(list_of_all_items))
brown_count = len(set(list_of_brown_items).difference(list_of_all_items))
blue_count = len(set(list_of_blue_items).difference(list_of_all_items))

This is far more efficient, because set intersections only require a loop over one of the two sets involved; each item can then be tested against the other set in constant time. The looping is done in C code (when creating the set objects and when calculating the difference).

Using a set for all items is not really required as set.difference() takes any iterable, but it is slightly faster:

>>> import timeit
>>> import random
>>> all = range(10000)
>>> random.shuffle(all)
>>> all[:-1000] = []
>>> some = [random.randrange(10000) for _ in range(1000)]
>>> timeit.timeit('len(set(some).difference(all))', 'from __main__ import some, all', number=10000)
0.9517788887023926
>>> timeit.timeit('len(set(some).difference(all))', 'from __main__ import some, all; all = set(all)', number=10000)
0.90407395362854
like image 114
Martijn Pieters Avatar answered Apr 28 '23 04:04

Martijn Pieters