Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

map array of numbers to rank efficiently in Python

Hi I'm trying to map an array of numbers to their ranks. So for example [2,5,3] would become [0,2,1].

I'm currently using np.where to lookup the rank in an array, but this is proving to take a very long time as I have to do this for a very large array (over 2 million datapoints).

If anyone has any suggestions on how I could achieve this, I'd greatly appreciate it!

[EDIT] This is what the code to change a specific row currently looks like:

def change_nodes(row): 
  a = row
  new_a = node_map[node_map[:,1] == a][0][0]
  return new_a

[EDIT 2] Duplicated numbers should additionally have the same rank

[EDIT 3] Additionally, unique numbers should only count once towards the ranking. So for example, the rankings for this list [2,3,3,4,5,7,7,7,7,8,1], would be:

{1:0, 2:1, 3:2, 4:3, 5:4, 7:5, 8:6 }

like image 530
chris Avatar asked Feb 15 '26 23:02

chris


2 Answers

What you want to use is numpy.argsort:

>>> import numpy as np
>>> x = np.array([2, 5, 3])
>>> x.argsort()
array([0, 2, 1])

See this question and its answers for thoughts on adjusting how ties are handled.

like image 146
dbliss Avatar answered Feb 17 '26 13:02

dbliss


Here is an efficient solution and a comparison with the solution using index (the index solution is also not correct with the added (edit 3) restriction to the question)

import numpy as np

def rank1(x):
    # Sort values i = 0, 1, 2, .. using x[i] as key
    y = sorted(range(len(x)), key = lambda i: x[i])
    # Map each value of x to a rank. If a value is already associated with a
    # rank, the rank is updated. Iterate in reversed order so we get the
    # smallest rank for each value.
    rank = { x[y[i]]: i for i in xrange(len(y) -1, -1 , -1) }
    # Remove gaps in the ranks
    kv = sorted(rank.iteritems(), key = lambda p: p[1])
    for i in range(len(kv)):
        kv[i] = (kv[i][0], i)
    rank = { p[0]: p[1] for p in kv }
    # Pre allocate a array to fill with ranks
    r = np.zeros((len(x),), dtype=np.int)
    for i, v in enumerate(x):
        r[i] = rank[v]
    return r

def rank2(x):
    x_sorted = sorted(x)
    # creates a new list to preserve x
    rank = list(x)
    for v in x_sorted:
        rank[rank.index(v)] = x_sorted.index(v)
    return rank

Comparison results

>>> d = np.arange(1000)
>>> random.shuffle(d)
>>> %timeit rank1(d)
100 loops, best of 3: 1.97 ms per loop
>>> %timeit rank2(d)
1 loops, best of 3: 226 ms per loop

>>> d = np.arange(10000)
>>> random.shuffle(d)
>>> %timeit rank1(d)
10 loops, best of 3: 32 ms per loop
>>> %timeit rank2(d)
1 loops, best of 3: 24.4 s per loop

>>> d = np.arange(100000)
>>> random.shuffle(d)
>>> %timeit rank1(d)
1 loops, best of 3: 433 ms per loop

>>> d = np.arange(2000000)
>>> random.shuffle(d)
>>> %timeit rank1(d)
1 loops, best of 3: 11.2 s per loop

The problem with the index solution is that the time complexity is O(n^2). The time complexity of my solution is O(n lg n), that is, the sort time.

like image 33
malbarbo Avatar answered Feb 17 '26 12:02

malbarbo