I have a list say l = [10,10,20,15,10,20]
. I want to assign each unique value a certain "index" to get [1,1,2,3,1,2]
.
This is my code:
a = list(set(l)) res = [a.index(x) for x in l]
Which turns out to be very slow.
l
has 1M elements, and 100K unique elements. I have also tried map with lambda and sorting, which did not help. What is the ideal way to do this?
Limitations and Restrictions. A unique index, UNIQUE constraint, or PRIMARY KEY constraint cannot be created if duplicate key values exist in the data. A unique nonclustered index can contain included nonkey columns. For more information, see Create Indexes with Included Columns.
No, you dont have to index it again. When you specify UNIQUE KEY , the column is indexed. So it has no difference in performance with other indexed column (e.g. PRIMARY KEY) of same type. However if the type is different, there will be little performance difference.
A unique index guarantees that the table won't have more than one row with the same value. It's advantageous to create unique indexes for two reasons: data integrity and performance. Lookups on a unique index are generally very fast.
You can do this in O(N)
time using a defaultdict
and a list comprehension:
>>> from itertools import count >>> from collections import defaultdict >>> lst = [10, 10, 20, 15, 10, 20] >>> d = defaultdict(count(1).next) >>> [d[k] for k in lst] [1, 1, 2, 3, 1, 2]
In Python 3 use __next__
instead of next
.
If you're wondering how it works?
The default_factory
(i.e count(1).next
in this case) passed to defaultdict
is called only when Python encounters a missing key, so for 10 the value is going to be 1, then for the next ten it is not a missing key anymore hence the previously calculated 1 is used, now 20 is again a missing key and Python will call the default_factory
again to get its value and so on.
d
at the end will look like this:
>>> d defaultdict(<method-wrapper 'next' of itertools.count object at 0x1057c83b0>, {10: 1, 20: 2, 15: 3})
The slowness of your code arises because a.index(x)
performs a linear search and you perform that linear search for each of the elements in l
. So for each of the 1M items you perform (up to) 100K comparisons.
The fastest way to transform one value to another is looking it up in a map. You'll need to create the map and fill in the relationship between the original values and the values you want. Then retrieve the value from the map when you encounter another of the same value in your list.
Here is an example that makes a single pass through l
. There may be room for further optimization to eliminate the need to repeatedly reallocate res
when appending to it.
res = [] conversion = {} i = 0 for x in l: if x not in conversion: value = conversion[x] = i i += 1 else: value = conversion[x] res.append(value)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With