I have the following two lists:
lst1 = [('vr1', '635', '1'), ('vr1', '32', '1'), ('vr1', '784', '0.526'), ('vr1', '431', '1')]
lst2 = [('vr1', '635', '3'), ('vr1', '784', '2.526'), ('vr1', '431', '2')]
Currently I determine what is unique for lst1
with the following code. Where the comparison is based on the contents of the first two columns for every entry.
uniq = set([i[0:2] for i in lst1]).difference([j[0:2] for j in lst2])
Giving:
set([('vr1', '32')])
I then search for every entry in lst1
if it's contained in uniq
to get the complete entry.
uniq_full = [i for i in lst1 if i[0:2] in uniq]
Which returns the entry the way I want it.
[('vr1', '32', '1')]
My question is: Is there a faster way of obtaining the full entry?
Currently you are looping through lst1
and lst2
once to generate uniq
. Then you are looping through lst1
again to generate uniq_full
.
Instead, you could loop through lst2
once to generate the keys to be removed,
then loop through lst1
once to filter out the unwanted elements:
lst1 = [('vr1', '635', '1'), ('vr1', '32', '1'), ('vr1', '784', '0.526'), ('vr1', '431', '1')]
lst2 = [('vr1', '635', '3'), ('vr1', '784', '2.526'), ('vr1', '431', '2')]
remove_keys = set([item[:2] for item in lst2])
unique = [item for item in lst1 if item[:2] not in remove_keys]
print(unique)
yields
[('vr1', '32', '1')]
And here is a timeit test (using IPython) which shows it is faster:
def orig(lst1, lst2):
uniq = set([i[0:2] for i in lst1]).difference([j[0:2] for j in lst2])
uniq_full = [i for i in lst1 if i[0:2] in uniq]
return uniq_full
def alt(lst1, lst2):
remove_keys = set([item[:2] for item in lst2])
unique = [item for item in lst1 if item[:2] not in remove_keys]
return unique
In [4]: %timeit orig(lst1, lst2)
100000 loops, best of 3: 2.29 µs per loop
In [5]: %timeit alt(lst1, lst2)
1000000 loops, best of 3: 1.36 µs per loop
Regarding's @otus's comment (below): Let's see if using a generator expression when creating remove_keys
improves speed:
def alt2(lst1, lst2):
remove_keys = set(item[:2] for item in lst2)
unique = [item for item in lst1 if item[:2] not in remove_keys]
return unique
In [7]: %timeit alt2(lst1, lst2)
1000000 loops, best of 3: 1.54 µs per loop
Here is a benchmark for lists with ~10**4 items:
In [8]: lst1 = [('vr1', '635', '1'), ('vr1', '32', '1'), ('vr1', '784', '0.526'), ('vr1', '431', '1')]*10000
In [9]: lst2 = [('vr1', '635', '3'), ('vr1', '784', '2.526'), ('vr1', '431', '2')]*10000
In [10]: %timeit alt(lst1, lst2)
100 loops, best of 3: 9.34 ms per loop
In [11]: %timeit alt2(lst1, lst2)
100 loops, best of 3: 9.49 ms per loop
In [12]: %timeit orig(lst1, lst2)
100 loops, best of 3: 13.5 ms per loop
And here is another benchmark for lists with ~10**6 items:
In [19]: %timeit alt(lst1, lst2)
1 loops, best of 3: 972 ms per loop
In [20]: %timeit alt2(lst1, lst2)
1 loops, best of 3: 957 ms per loop
So for small or medium sized lists, a list comprehension is faster, but for larger lists, using generator expression is faster. Of course, what is considered large or small depends on your machine. You'll need to benchmark with timeit to know what is best for you.
Typically, it appears that when you have enough memory, if you need to iterate through the entire collection, using a list comprehension is faster than a generator. A generator is better when you do not have enough memory or do not need to iterate over the entire collection.
It's also interesting to look at @thg435's solution. In some cases it is faster:
def using_dicts(lst1, lst2):
d1 = {x[0:2]:x for x in lst1}
d2 = {x[0:2]:x for x in lst2}
return [d1[key] for key in set(d1) - set(d2)]
If your lists contain a lot of duplicate keys:
lst1 = [('vr1', '635', '1'), ('vr1', '32', '1'), ('vr1', '784', '0.526'), ('vr1', '431', '1')]*10000
lst2 = [('vr1', '635', '3'), ('vr1', '784', '2.526'), ('vr1', '431', '2')]*10000
Then using_dicts
is faster than alt
:
In [31]: %timeit alt(lst1, lst2)
100 loops, best of 3: 8.39 ms per loop
In [32]: %timeit using_dicts(lst1, lst2)
100 loops, best of 3: 7.98 ms per loop
I think this is because in the example above, lst1
and lst2
contain so many duplicate x[0:2]
s that d1
and d2
are tiny.
If your lst1
and lst2
contain lots of unique keys, such as when
lst1 = [(i,i,i) for i in range(10**4+100)]
lst2 = [(i,i,i) for i in range(10**4)]
then alt
is faster than using_dicts
:
In [34]: %timeit alt(lst1, lst2)
100 loops, best of 3: 3.12 ms per loop
In [35]: %timeit using_dicts(lst1, lst2)
100 loops, best of 3: 5.93 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With