I have a list of elements which should be unique by construction. I mean, no element will be present more than once in the list.
I want to efficiently test if an item is present in that list, and this for many items.
The test is more efficient if I convert the list into a set.
Now my questions are about how to efficiently build the set.
I guess that when I do my_set = set(my_list), python somehow has to test membership of items in the list as it progressively constructs the set.
Is this sub-optimal given the knowledge that I know that the list does not contain duplicates?
Is it possible to do better?
Does the answer to the question above change if instead of a list, I have an iterator (about which I still know that the items it will yield will be unique)?
Python doesn't do explicit membership tests when constructing a set. It doesn't need to; sets are unique by their nature, which is that the members are indexed by their hash values. So in constructing a set, all Python does is hash each element in turn and then insert it in the appropriate place.
The Python docs on time complexity don't explicitly list set construction, but they do say that most operations are the same as for a dict, and insertion into a dict is O(1), by which we can assume that set construction is O(n).
Since set() uses a hashtable (see How is set() implemented?), more time would be spent hashing than comparing, and this is unavoidable.
I assume your dataset is very large if you're this concerned about performance. The only way you're going to get better performance is by creating the set() in the first place and avoiding the intermediate memory usage of the list().
$ python3 -m timeit 'set(list(range(100000)))'
100 loops, best of 3: 8.69 msec per loop
$ python3 -m timeit 'set(range(100000))'
100 loops, best of 3: 7.67 msec per loop
$ python3 -m timeit 'frozenset(range(100000))'
100 loops, best of 3: 7.68 msec per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With