I have a list of lists (which can contain up to 90k elements)
[[1,2,3], [1,2,4], [1,2,3], [1,2,4], [1,2,5]]
I would like to assign an id to each elements, where the id is unique, except when the item is duplicated. So for the list above, I'd need this back:
[0,1,0,1,2]
What is the most effective way to do this?
creating a unique random id To create a random id, you call the uuid4 () method and it will automatically generate a unique id for you just as shown in the example below; Example of usage.
any(string. count(x)>=2 for x in lst) Will be true if at least 1 item in lst appears twice or more in string.
The simplest way to generate identifiers is by a serial number. A steadily increasing number that is assigned to whatever you need to identify next. This is the approached used in most internal databases as well as some commonly encountered public identifiers.
Keep a map of already seen elements with the associated id.
from itertools import count
from collections import defaultdict
mapping = defaultdict(count().__next__)
result = []
for element in my_list:
result.append(mapping[tuple(element)])
you could also use a list-comprehension:
result = [mapping[tuple(element)] for element in my_list]
Unfortunately list
s aren't hashable so you have to convert them to a tuple
when storing them as keys of the mapping.
Note the trick of using defaultdict
, and count().__next__
to provide unique increasing ids. On python2 you have to replace .__next__
with .next
.
The defaultdict
will assign a default value when it cannot find a key. The default value is obtained by calling the function provided in the constructor.
In this case the __next__
method of the count()
generator yields increasing numbers.
As a more portable alternative you could do:
from functools import partial
mapping = defaultdict(partial(next, count()))
An alternative solution, as proposed in the comments, is to just use the index as unique id:
result = [my_list.index(el) for el in my_list]
This is imple however:
For comparison of the two solutions see:
In [1]: from itertools import count
...: from collections import defaultdict
In [2]: def hashing(seq):
...: mapping = defaultdict(count().__next__)
...: return [mapping[tuple(el)] for el in seq]
...:
In [3]: def indexing(seq):
...: return [seq.index(i) for i in seq]
...:
In [4]: from random import randint
In [5]: seq = [[randint(1, 20), randint(1, 20), randint(1, 20)] for _ in range(90000)]
In [6]: %timeit hashing(seq)
10 loops, best of 3: 37.7 ms per loop
In [7]: %timeit indexing(seq)
1 loop, best of 3: 26 s per loop
Note how for a 90k element list the mapping solution takes less 40 milliseconds while the indexing solution takes 26 seconds.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With