I checked this comparing lists, Only one answer is relative to what I am trying to do. I have to lists with some similiar elements, I want to get the non -matching elements.
len(h) = 1973182 #h[0] = 'B00006J8F4F2', y[0] = 'B0075Y2X2GO6'
len(y) = 656890
I am doing
new_list = [i for i in h if i not in y]
,however this takes about 13 minutes to do, Is there a faster way of doing this?
In refer to "duplicate" question, Finding elements not in a list, I use the same code, What I am looking for is a faster way of doing it.
You can use sets
to more efficiently find the difference between both lists. If you need to keep the order in the original list you can use sorted
with a key
.
We want to sort the elements in the set according to their appearance in the original list, so one way is to build a lookup dictionary. We can use enumerate
for that. Then we only need to lookup on the dictionary as a key
function:
d = {j:i for i,j in enumerate(h)}
new_list = sorted(list((set(h) - set(y))), key = lambda x: d[x])
Let's try with a simple example:
y = range(5)
h = range(7)
d = {j:i for i,j in enumerate(h)}
sorted(list((set(h) - set(y))), key = lambda x: d[x])
# [5, 6]
Timings -
import random
y = random.sample(range(1, 10001), 10000)
h = random.sample(range(1, 20001), 10000)
%timeit [i for i in h if i not in y]
# 1.28 s ± 37.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
def using_sets(a,b):
d = {j:i for i,j in enumerate(a)}
sorted(list((set(a) - set(b))), key = lambda x: d[x])
%timeit using_sets(h,y)
# 6.16 ms ± 373 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So there's a clear improvement, with the proposed approach performing up to 200 times faster.
The answer you linked to suggests using sets, because they use hashes to look thing up quickly.
With lists, and in
, like
new_list = [i for i in h if i not in y]
the whole of list y
needs checking each time for each i
in h
.
You could use sets, but as has been pointed out need to be careful with duplicates getting lost.
You could use a Counter
:
from collections import Counter
the with two lists, say
l1 = [1,1,2,3,4]
l2 = [3,3,4,5,6]
for examples' sake, can use fed into a Counter each
>>> Counter(l1)
Counter({1: 2, 2: 1, 3: 1, 4: 1})
>>> Counter(l2)
Counter({3: 2, 4: 1, 5: 1, 6: 1})
This just walks each list once. Subtracting them gives what's in the first but not the second:
>>> Counter(l1)-Counter(l2)
Counter({1: 2, 2: 1})
The elements
tell you what you want
>>> diff = Counter(l1)-Counter(l2)
>>> list(diff.elements())
[1, 1, 2]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With