Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

compare two lists to get non matching elements

Tags:

python

list

I checked this comparing lists, Only one answer is relative to what I am trying to do. I have to lists with some similiar elements, I want to get the non -matching elements.

len(h) = 1973182  #h[0] = 'B00006J8F4F2', y[0] = 'B0075Y2X2GO6'
len(y) = 656890

I am doing

new_list = [i for i in h if i not in y],however this takes about 13 minutes to do, Is there a faster way of doing this?

In refer to "duplicate" question, Finding elements not in a list, I use the same code, What I am looking for is a faster way of doing it.

like image 295
programmerwiz32 Avatar asked Sep 13 '25 05:09

programmerwiz32


2 Answers

You can use sets to more efficiently find the difference between both lists. If you need to keep the order in the original list you can use sorted with a key.

We want to sort the elements in the set according to their appearance in the original list, so one way is to build a lookup dictionary. We can use enumerate for that. Then we only need to lookup on the dictionary as a key function:

d = {j:i for i,j in enumerate(h)}
new_list  = sorted(list((set(h) - set(y))), key = lambda x: d[x])

Let's try with a simple example:

y = range(5)
h = range(7)
d = {j:i for i,j in enumerate(h)}
sorted(list((set(h) - set(y))), key = lambda x: d[x])
# [5, 6]

Timings -

import random
y = random.sample(range(1, 10001), 10000)
h = random.sample(range(1, 20001), 10000)

%timeit [i for i in h if i not in y]
# 1.28 s ± 37.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

def using_sets(a,b):
    d = {j:i for i,j in enumerate(a)}
    sorted(list((set(a) - set(b))), key = lambda x: d[x])

%timeit using_sets(h,y)
# 6.16 ms ± 373 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So there's a clear improvement, with the proposed approach performing up to 200 times faster.

like image 83
yatu Avatar answered Sep 15 '25 18:09

yatu


The answer you linked to suggests using sets, because they use hashes to look thing up quickly. With lists, and in, like

new_list = [i for i in h if i not in y]

the whole of list y needs checking each time for each i in h.

You could use sets, but as has been pointed out need to be careful with duplicates getting lost.

You could use a Counter:

from collections import Counter

the with two lists, say

l1 = [1,1,2,3,4]
l2 = [3,3,4,5,6]

for examples' sake, can use fed into a Counter each

>>> Counter(l1)
Counter({1: 2, 2: 1, 3: 1, 4: 1})
>>> Counter(l2)
Counter({3: 2, 4: 1, 5: 1, 6: 1})

This just walks each list once. Subtracting them gives what's in the first but not the second:

>>> Counter(l1)-Counter(l2)
Counter({1: 2, 2: 1})

The elements tell you what you want

>>> diff = Counter(l1)-Counter(l2)
>>> list(diff.elements())
[1, 1, 2]
like image 45
doctorlove Avatar answered Sep 15 '25 19:09

doctorlove