Say, I wish to compute the difference of two lists C = A - B
:
A = [1,2,3,4,5,6,7,8,9]
B = [1,3,5,8,9]
C = [2,4,6,7] #Result
A
and B
are both sorted with unique integers (not sure if there is a way to tell Python about this property of the list). I need to preserve the order of the elements. AFAIK there are two possible ways of doing it
Method 1: Convert B into a set and use list comprehension to generate C:
s = set(B)
C = [x for x in A if x not in s]
Method 2: Directly use list comprehension:
C = [x for x in A if x not in B]
Why is #1
more efficient than #2
? Isn't there an overhead to convert to a set? What am I missing here?
Some performance benchmarks are given in this answer.
UPDATE: I'm aware that a set's average O(1)
lookup time beats that of a list's O(n)
but if the original list A
contains about a million or so integers, wouldn't the set creation actually take longer?
Generally the lists are faster than sets. But in the case of searching for an element in a collection, sets are faster because sets have been implemented using hash tables. So basically Python does not have to search the full set, which means that the time complexity in average is O(1).
Given a list (ArrayList or LinkedList), convert it into a set (HashSet or TreeSet) of strings in Java. We simply create an list. We traverse the given set and one by one add elements to the list.
Because sets cannot have multiple occurrences of the same element, it makes sets highly useful to efficiently remove duplicate values from a list or tuple and to perform common math operations like unions and intersections.
As soon as you start testing membership of elements that are in the middle or end of the collection, sets perform 40% – 1800% better than lists or tuples You now have a fair idea why you should think of using sets for large collections…
There is overhead to convert a list to a set, but a set is substantially faster than a list for those in
tests.
You can instantly see if item x
is in set y
because there's a hash table being used underneath. No matter how large your set is, the lookup time is the same (basically instantaneous) - this is known in Big-O notation as O(1). For a list, you have to individually check every element to see if item x
is in list z
. As your list grows, the check will take longer - this is O(n), meaning the length of the operation is directly tied to how long the list is.
That increased speed can offset the set creation overhead, which is how your set check ends up being faster.
EDIT: to answer that other question, Python has no way of determining that your list is sorted - not if you're using a standard list
object, anyway. So it can't achieve O(log n) performance with a list comprehension. If you wanted to write your own binary search method which assumes the list is sorted, you can certainly do so, but O(1) beats O(log n) any day.
EDIT 2:
I'm aware that a set's average O(1) lookup time beats that of a list's O(n) but if the original list A contains about a million or so integers, wouldn't the set creation actually take longer?
No, not at all. Creating a set out of a list is a O(n) operation, as inserting an item into a set is O(1) and you're doing that n times. If you have a list with a million integers in it, converting it into a set involves two O(n) steps, while repeatedly scanning the list is going to be n O(n) steps. In practice, creating the set is going to be about 250,000 times faster for a list with a million integers, and the speed difference will grow larger and larger the more items you have in your list.
Average time complexity for lookup (x in S) in a set is O(1) while the same for a list is O(n).
You can check the details at https://wiki.python.org/moin/TimeComplexity
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With