Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Obtain unique entries in column specific list comparison, efficiently

I have the following two lists:

lst1 = [('vr1', '635', '1'), ('vr1', '32', '1'), ('vr1', '784', '0.526'), ('vr1', '431', '1')]

lst2 = [('vr1', '635', '3'), ('vr1', '784', '2.526'), ('vr1', '431', '2')]

Currently I determine what is unique for lst1 with the following code. Where the comparison is based on the contents of the first two columns for every entry.

uniq = set([i[0:2] for i in lst1]).difference([j[0:2] for j in lst2])

Giving:

set([('vr1', '32')])

I then search for every entry in lst1 if it's contained in uniq to get the complete entry.

uniq_full = [i for i in lst1 if i[0:2] in uniq]

Which returns the entry the way I want it.

[('vr1', '32', '1')]

My question is: Is there a faster way of obtaining the full entry?

like image 398
Mvsm Avatar asked Feb 14 '23 21:02

Mvsm


1 Answers

Currently you are looping through lst1 and lst2 once to generate uniq. Then you are looping through lst1 again to generate uniq_full.

Instead, you could loop through lst2 once to generate the keys to be removed, then loop through lst1 once to filter out the unwanted elements:

lst1 = [('vr1', '635', '1'), ('vr1', '32', '1'), ('vr1', '784', '0.526'), ('vr1', '431', '1')]    
lst2 = [('vr1', '635', '3'), ('vr1', '784', '2.526'), ('vr1', '431', '2')]

remove_keys = set([item[:2] for item in lst2])
unique = [item for item in lst1 if item[:2] not in remove_keys]
print(unique)

yields

[('vr1', '32', '1')]

And here is a timeit test (using IPython) which shows it is faster:

def orig(lst1, lst2):
    uniq = set([i[0:2] for i in lst1]).difference([j[0:2] for j in lst2])
    uniq_full = [i for i in lst1 if i[0:2] in uniq]
    return uniq_full

def alt(lst1, lst2):
    remove_keys = set([item[:2] for item in lst2])
    unique = [item for item in lst1 if item[:2] not in remove_keys]
    return unique

In [4]: %timeit orig(lst1, lst2)
100000 loops, best of 3: 2.29 µs per loop

In [5]: %timeit alt(lst1, lst2)
1000000 loops, best of 3: 1.36 µs per loop

Regarding's @otus's comment (below): Let's see if using a generator expression when creating remove_keys improves speed:

def alt2(lst1, lst2):
    remove_keys = set(item[:2] for item in lst2)
    unique = [item for item in lst1 if item[:2] not in remove_keys]
    return unique

In [7]: %timeit alt2(lst1, lst2)
1000000 loops, best of 3: 1.54 µs per loop

Here is a benchmark for lists with ~10**4 items:

In [8]: lst1 = [('vr1', '635', '1'), ('vr1', '32', '1'), ('vr1', '784', '0.526'), ('vr1', '431', '1')]*10000

In [9]: lst2 = [('vr1', '635', '3'), ('vr1', '784', '2.526'), ('vr1', '431', '2')]*10000

In [10]: %timeit alt(lst1, lst2)
100 loops, best of 3: 9.34 ms per loop

In [11]: %timeit alt2(lst1, lst2)
100 loops, best of 3: 9.49 ms per loop

In [12]: %timeit orig(lst1, lst2)
100 loops, best of 3: 13.5 ms per loop

And here is another benchmark for lists with ~10**6 items:

In [19]: %timeit alt(lst1, lst2)
1 loops, best of 3: 972 ms per loop

In [20]: %timeit alt2(lst1, lst2)
1 loops, best of 3: 957 ms per loop

So for small or medium sized lists, a list comprehension is faster, but for larger lists, using generator expression is faster. Of course, what is considered large or small depends on your machine. You'll need to benchmark with timeit to know what is best for you.

Typically, it appears that when you have enough memory, if you need to iterate through the entire collection, using a list comprehension is faster than a generator. A generator is better when you do not have enough memory or do not need to iterate over the entire collection.


It's also interesting to look at @thg435's solution. In some cases it is faster:

def using_dicts(lst1, lst2):
    d1 = {x[0:2]:x for x in lst1}
    d2 = {x[0:2]:x for x in lst2}
    return [d1[key] for key in set(d1) - set(d2)]

If your lists contain a lot of duplicate keys:

lst1 = [('vr1', '635', '1'), ('vr1', '32', '1'), ('vr1', '784', '0.526'), ('vr1', '431', '1')]*10000

lst2 = [('vr1', '635', '3'), ('vr1', '784', '2.526'), ('vr1', '431', '2')]*10000

Then using_dicts is faster than alt:

In [31]: %timeit alt(lst1, lst2)
100 loops, best of 3: 8.39 ms per loop

In [32]: %timeit using_dicts(lst1, lst2)
100 loops, best of 3: 7.98 ms per loop

I think this is because in the example above, lst1 and lst2 contain so many duplicate x[0:2]s that d1 and d2 are tiny.

If your lst1 and lst2 contain lots of unique keys, such as when

lst1 = [(i,i,i) for i in range(10**4+100)]
lst2 = [(i,i,i) for i in range(10**4)]

then alt is faster than using_dicts:

In [34]: %timeit alt(lst1, lst2)
100 loops, best of 3: 3.12 ms per loop

In [35]: %timeit using_dicts(lst1, lst2)
100 loops, best of 3: 5.93 ms per loop
like image 124
unutbu Avatar answered Apr 26 '23 23:04

unutbu