Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Indexing a list with an unique index

I have a list say l = [10,10,20,15,10,20]. I want to assign each unique value a certain "index" to get [1,1,2,3,1,2].

This is my code:

a = list(set(l)) res = [a.index(x) for x in l] 

Which turns out to be very slow.

l has 1M elements, and 100K unique elements. I have also tried map with lambda and sorting, which did not help. What is the ideal way to do this?

like image 394
Yfiua Avatar asked Dec 16 '15 13:12

Yfiua


People also ask

Can Unique Key be indexed?

Limitations and Restrictions. A unique index, UNIQUE constraint, or PRIMARY KEY constraint cannot be created if duplicate key values exist in the data. A unique nonclustered index can contain included nonkey columns. For more information, see Create Indexes with Included Columns.

Should indexed column be unique?

No, you dont have to index it again. When you specify UNIQUE KEY , the column is indexed. So it has no difference in performance with other indexed column (e.g. PRIMARY KEY) of same type. However if the type is different, there will be little performance difference.

Are unique indexes faster?

A unique index guarantees that the table won't have more than one row with the same value. It's advantageous to create unique indexes for two reasons: data integrity and performance. Lookups on a unique index are generally very fast.


2 Answers

You can do this in O(N) time using a defaultdict and a list comprehension:

>>> from itertools import count >>> from collections import defaultdict >>> lst = [10, 10, 20, 15, 10, 20] >>> d = defaultdict(count(1).next) >>> [d[k] for k in lst] [1, 1, 2, 3, 1, 2] 

In Python 3 use __next__ instead of next.


If you're wondering how it works?

The default_factory(i.e count(1).next in this case) passed to defaultdict is called only when Python encounters a missing key, so for 10 the value is going to be 1, then for the next ten it is not a missing key anymore hence the previously calculated 1 is used, now 20 is again a missing key and Python will call the default_factory again to get its value and so on.

d at the end will look like this:

>>> d defaultdict(<method-wrapper 'next' of itertools.count object at 0x1057c83b0>,             {10: 1, 20: 2, 15: 3}) 
like image 66
Ashwini Chaudhary Avatar answered Sep 22 '22 10:09

Ashwini Chaudhary


The slowness of your code arises because a.index(x) performs a linear search and you perform that linear search for each of the elements in l. So for each of the 1M items you perform (up to) 100K comparisons.

The fastest way to transform one value to another is looking it up in a map. You'll need to create the map and fill in the relationship between the original values and the values you want. Then retrieve the value from the map when you encounter another of the same value in your list.

Here is an example that makes a single pass through l. There may be room for further optimization to eliminate the need to repeatedly reallocate res when appending to it.

res = [] conversion = {} i = 0 for x in l:     if x not in conversion:         value = conversion[x] = i         i += 1     else:         value = conversion[x]     res.append(value) 
like image 39
dsh Avatar answered Sep 23 '22 10:09

dsh