Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently build a python set from a list of unique elements

I have a list of elements which should be unique by construction. I mean, no element will be present more than once in the list.

I want to efficiently test if an item is present in that list, and this for many items.

The test is more efficient if I convert the list into a set.

Now my questions are about how to efficiently build the set.

I guess that when I do my_set = set(my_list), python somehow has to test membership of items in the list as it progressively constructs the set.

  1. Is this sub-optimal given the knowledge that I know that the list does not contain duplicates?

  2. Is it possible to do better?

  3. Does the answer to the question above change if instead of a list, I have an iterator (about which I still know that the items it will yield will be unique)?

like image 280
bli Avatar asked Apr 28 '26 08:04

bli


2 Answers

Python doesn't do explicit membership tests when constructing a set. It doesn't need to; sets are unique by their nature, which is that the members are indexed by their hash values. So in constructing a set, all Python does is hash each element in turn and then insert it in the appropriate place.

The Python docs on time complexity don't explicitly list set construction, but they do say that most operations are the same as for a dict, and insertion into a dict is O(1), by which we can assume that set construction is O(n).

like image 143
Daniel Roseman Avatar answered Apr 30 '26 21:04

Daniel Roseman


Since set() uses a hashtable (see How is set() implemented?), more time would be spent hashing than comparing, and this is unavoidable.

I assume your dataset is very large if you're this concerned about performance. The only way you're going to get better performance is by creating the set() in the first place and avoiding the intermediate memory usage of the list().

$ python3 -m timeit 'set(list(range(100000)))'
100 loops, best of 3: 8.69 msec per loop

$ python3 -m timeit 'set(range(100000))'
100 loops, best of 3: 7.67 msec per loop

$ python3 -m timeit 'frozenset(range(100000))'
100 loops, best of 3: 7.68 msec per loop
like image 38
rezca Avatar answered Apr 30 '26 21:04

rezca