Sorry for the poorly worded title but I asked a question earlier about getting a unique list of items from two lists. People told me to make the list -> sets and then union.
So now I'm wondering if it's faster to:
I should probably just read up on sets in hindsight...
In Python, by the way - sorry for not clarifying.
Sets use hashing to perform look ups which makes them way faster than lists in this regard. (In the practical example the code using lists took about 45 seconds to run, whereas the code with sets took less than a tenth of a second!)
For example, sets do membership tests a lot more efficiently than lists. In case you are from a computer science background, this is because the average case time complexity of membership tests in sets are O(1) vs O(n) for lists. The code below shows a membership test using a list.
Lists are slightly faster than sets when you just want to iterate over the values. Sets, however, are significantly faster than lists if you want to check if an item is contained within it. They can only contain unique items though.
Generally the lists are faster than sets. But in the case of searching for an element in a collection, sets are faster because sets have been implemented using hash tables. So basically Python does not have to search the full set, which means that the time complexity in average is O(1).
as you can see extending one list by another end then remove duplicates by making set is the fastest way(at least in python;))
>>> def foo():
... """
... extending one list by another end then remove duplicates by making set
... """
... l1 = range(200)
... l2 = range(150, 250)
... l1.extend(l2)
... set(l1)
...
>>> def bar():
... """
... checking if element is on one list end adding it only if not
... """
... l1 = range(200)
... l2 = range(150, 250)
... for elem in l2:
... if elem not in l1:
... l1.append(elem)
...
>>> def baz():
... """
... making sets from both lists and then union from them
... """
... l1 = range(200)
... l2 = range(150, 250)
... set(l1) | set(l2)
...
>>> from timeit import Timer
>>> Timer(foo).timeit(10000)
0.265153169631958
>>> Timer(bar).timeit(10000)
7.921358108520508
>>> Timer(baz).timeit(10000)
0.3845551013946533
>>>
I really like the approach virhilo did, but it's a pretty specific set of data he was testing. In all this don't just test the functions, but test them how you'll be doing it. I put together a much more exhaustive test set. It runs each function you specify (with just a little decorator) through a list of comparisons, and figures out how long each function takes and therefore how much slower it is. The result is that it's not always clear which function you should be doing without knowing more about the size, overlap and type of your data.
Here's my test program, below will be the output.
from timeit import Timer
from copy import copy
import random
import sys
funcs = []
class timeMe(object):
def __init__(self, f):
funcs.append(f)
self.f = f
def __call__(self, *args, **kwargs):
return self.f(*args, **kwargs)
@timeMe
def extend_list_then_set(input1, input2):
"""
extending one list by another end then remove duplicates by making set
"""
l1 = copy(input1)
l2 = copy(input2)
l1.extend(l2)
set(l1)
@timeMe
def per_element_append_to_list(input1, input2):
"""
checking if element is on one list end adding it only if not
"""
l1 = copy(input1)
l2 = copy(input2)
for elem in l2:
if elem not in l1:
l1.append(elem)
@timeMe
def union_sets(input1, input2):
"""
making sets from both lists and then union from them
"""
l1 = copy(input1)
l2 = copy(input2)
set(l1) | set(l2)
@timeMe
def set_from_one_add_from_two(input1, input2):
"""
make set from list 1, then add elements for set 2
"""
l1 = copy(input1)
l2 = copy(input2)
l1 = set(l1)
for element in l2:
l1.add(element)
@timeMe
def set_from_one_union_two(input1, input2):
"""
make set from list 1, then union list 2
"""
l1 = copy(input1)
l2 = copy(input2)
x = set(l1).union(l2)
@timeMe
def chain_then_set(input1, input2):
"""
chain l1 & l2, then make a set out of that
"""
l1 = copy(input1)
l2 = copy(input2)
set(itertools.chain(l1, l2))
def run_results(l1, l2, times):
for f in funcs:
t = Timer('%s(l1, l2)' % f.__name__,
'from __main__ import %s; l1 = %s; l2 = %s' % (f.__name__, l1, l2))
yield (f.__name__, t.timeit(times))
test_datasets = [
('original, small, some overlap', range(200), range(150, 250), 10000),
('no overlap: l1 = [1], l2 = [2..100]', [1], range(2, 100), 10000),
('lots of overlap: l1 = [1], l2 = [1]*100', [1], [1]*100, 10000),
('50 random ints below 2000 in each', [random.randint(0, 2000) for x in range(50)], [random.randint(0, 2000) for x in range(50)], 10000),
('50 elements in each, no overlap', range(50), range(51, 100), 10000),
('50 elements in each, total overlap', range(50), range(50), 10000),
('500 random ints below 500 in each', [random.randint(0, 500) for x in range(500)], [random.randint(0, 500) for x in range(500)], 1000),
('500 random ints below 2000 in each', [random.randint(0, 2000) for x in range(500)], [random.randint(0, 2000) for x in range(500)], 1000),
('500 random ints below 200000 in each', [random.randint(0, 200000) for x in range(500)], [random.randint(0, 200000) for x in range(500)], 1000),
('500 elements in each, no overlap', range(500), range(501, 1000), 10000),
('500 elements in each, total overlap', range(500), range(500), 10000),
('10000 random ints below 200000 in each', [random.randint(0, 200000) for x in range(10000)], [random.randint(0, 200000) for x in range(10000)], 50),
('10000 elements in each, no overlap', range(10000), range(10001, 20000), 10),
('10000 elements in each, total overlap', range(10000), range(10000), 10),
('original lists 100 times', range(200)*100, range(150, 250)*100, 10),
]
fullresults = []
for description, l1, l2, times in test_datasets:
print "Now running %s times: %s" % (times, description)
results = list(run_results(l1, l2, times))
speedresults = [x for x in sorted(results, key=lambda x: x[1])]
for name, speed in results:
finish = speedresults.index((name, speed)) + 1
timesslower = speed / speedresults[0][1]
fullresults.append((description, name, speed, finish, timesslower))
print '\t', finish, ('%.2fx' % timesslower).ljust(10), name.ljust(40), speed
print
import csv
out = csv.writer(sys.stdout)
out.writerow(('Test', 'Function', 'Speed', 'Place', 'timesslower'))
out.writerows(fullresults)
My point here is to encourage you to test with your data, so I don't want to harp on specifics. However... The first extend method is the fastest average method, but set_from_one_union_two (x = set(l1).union(l2)
) wins a few of times. You can get more details if you run the script yourself.
The numbers I'm reporting are the number of times slower this function is than the fatest function on that test. If it was the fastest, it will be 1.
Functions
Tests extend_list_then_set per_element_append_to_list set_from_one_add_from_two set_from_one_union_two union_sets chain_then_set
original, small, some overlap 1 25.04 1.53 1.18 1.39 1.08
no overlap: l1 = [1], l2 = [2..100] 1.08 13.31 2.10 1 1.27 1.07
lots of overlap: l1 = [1], l2 = [1]*100 1.10 1.30 2.43 1 1.25 1.05
50 random ints below 2000 in each 1 7.76 1.35 1.20 1.31 1
50 elements in each, no overlap 1 9.00 1.48 1.13 1.18 1.10
50 elements in each, total overlap 1.08 4.07 1.64 1.04 1.41 1
500 random ints below 500 in each 1.16 68.24 1.75 1 1.28 1.03
500 random ints below 2000 in each 1 102.42 1.64 1.43 1.81 1.20
500 random ints below 200000 in each 1.14 118.96 1.99 1.52 1.98 1
500 elements in each, no overlap 1.01 145.84 1.86 1.25 1.53 1
500 elements in each, total overlap 1 53.10 1.95 1.16 1.57 1.05
10000 random ints below 200000 in each 1 2588.99 1.73 1.35 1.88 1.12
10000 elements in each, no overlap 1 3164.01 1.91 1.26 1.65 1.02
10000 elements in each, total overlap 1 1068.67 1.89 1.26 1.70 1.05
original lists 100 times 1.11 2068.06 2.03 1 1.04 1.17
Average 1.04 629.25 1.82 1.19 1.48 1.06
Standard Deviation 0.05 1040.76 0.26 0.15 0.26 0.05
Max 1.16 3164.01 2.43 1.52 1.98 1.20
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With