Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Return majority weighted vote from array based in columns

I have a matrix x with 3 x 3 dimensions and a vector w that is 3,:

x = np.array([[1, 2, 1],
              [3, 2 ,1],
              [1, 2, 2]])

w = np.array([0.3, 0.4, 0.3])

I need to generate another vector y that is a majority vote for each row of x. Each column of x is weighted by the corresponding value in w. Something like this:

for y[0], it should look for X[0] => [1, 2, 1]

  • columns with value 1 = first and third [0,2]
  • columns with value 2 = second [1]
  • columns with value 3 = none

Sum the weights (in w) of the columns grouped by its value in X:

  • sum of weights of columns with value 1: 0.3 + 0.3 = 0.6
  • sum of weights of columns with value 2: 0.4
  • sum of weights of columns with value 3: 0

Since the sum of weights of columns with value 1 is the highest, y[0] = 1. And so on.

like image 249
heresthebuzz Avatar asked Sep 11 '20 23:09

heresthebuzz


2 Answers

You can do it with numpy if you understand broadcasting. The downside is that because the code is vectorized, you do more computations than you need. This would matter if the size of the w vector is very large.

Perhaps someone comes up with an easier way to write it, but this is how I would do it without thinking too much.

The answer first:

i = np.arange(3) + 1
m = (x.reshape((1,4,3)) == i.reshape((3,1,1)))
np.argmax(np.sum(m, axis=2).T*w, axis=1) + 1

Now the step-by-step explanation... Note that it is usually better to start counting from zero, but I followed your convention.

  1. I added one row so the array is not symmetric (easier to check shapes)

     In [1]: x = np.array([[1, 2, 1],
        ...:               [3, 2 ,1],
        ...:               [1, 2, 2],
        ...:               [3, 1, 3]])
        ...:
        ...: w = np.array([0.3, 0.4, 0.3])
    
  2. The first step is to have the array of indices i. Your convention starts at one.

     In [2]: i = np.arange(3) + 1
    
  3. the tricky step: create an array with shape (3,4,3), where the i-th entry of the array is a (4,3) array with all entries 0 or 1. It is 1 if and only if x == i. This is done by adding dimensions to x and i so they can be broadcasted. The operation basically compares all combinations of x and i, because all dimensions of x match size=1 dimension of i and viceversa:

     In [3]: m = (x.reshape((1,4,3)) == i.reshape((3,1,1)))*1
    
     In [4]: m
     Out[4]:
     array([[[1, 0, 1],
             [0, 0, 1],
             [1, 0, 0],
             [0, 1, 0]],
    
            [[0, 1, 0],
             [0, 1, 0],
             [0, 1, 1],
             [0, 0, 0]],
    
            [[0, 0, 0],
             [1, 0, 0],
             [0, 0, 0],
             [1, 0, 1]]])
    
  4. now you sum along rows (which is axis=2) to get the number of times each selection appeared in each row of x (note that the result is transposed when you compare it to x):

     In [5]: np.sum(m, axis=2)
     Out[5]:
     array([[2, 1, 1, 1],
            [1, 1, 2, 0],
            [0, 1, 0, 2]])
    
  5. I hope you can already see where this is going. You can read directly: In the first row of x, 1 appears twice and 2 appears once. In the second row of x, all appear once, in the third row of x, 1 appears once, 2 appears twice, etc.

  6. multiply this by the weights:

     In [7]: np.sum(m, axis=2).T*w
     Out[7]: 
     array([[0.6, 0.4, 0. ],
            [0.3, 0.4, 0.3],
            [0.3, 0.8, 0. ],
            [0.3, 0. , 0.6]])
    
  7. get the maximum along the rows (adding one to conform to your convention):

     In [8]: np.argmax(np.sum(m, axis=2).T*w, axis=1) + 1
     Out[8]: array([1, 2, 2, 3])
    

Special Case: a Tie

The following case was brought up in the comments:

x = np.array([[2, 2, 4, 1]])
w = np.array([0.1, 0.2, 0.3, 0.4])

the sum of the weights is:

[0.1, 0.4, 0., 0.4]

so in this case there is no winner. It isn't clear from the question what one would do in this case. One could take all, take none... One can look for these cases at the end:

final_w = np.sum(m, axis=2).T*w
result = np.argmax(np.sum(m*w, axis=2), axis=0) + 1
special_cases = np.argwhere(np.sum(final_w == np.max(final_w), axis=1) > 1)

Note: I used the reshape method for readability, but I often use np.expand_dims or np.newaxis. Something like this:

i = np.arange(3) + 1
m = (x[np.newaxis] == i[:, np.newaxis, np.newaxis])
np.argmax(np.sum(m, axis=2).T*w, axis=1) + 1

an alternative: you could also use some kind of compiled code. For example, numba is pretty easy to use in this case.

like image 160
Javier Gonzalez Avatar answered Nov 13 '22 11:11

Javier Gonzalez


Here's a really crazy way to do it, which involves sorting and indexing rather than adding a new dimension. This is sort of like the sort-based method used by np.unique.

First find the sorted indices in each row:

rows = np.repeat(np.arange(x.shape[0]), x.shape[1])  # [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]
cols = np.argsort(x, axis=1).ravel()                 # [0, 2, 1, 2, 1, 0, 0, 1, 2, 1, 0, 2]

Now you can create an array of sorted elements per-column, both unweighted and weighted. The former will be used to get the indices for summing, the latter will actually be summed.

u = x[rows, cols]                            # [1, 1, 2, 1, 2, 3, 1, 2, 2, 1, 3, 3]
v = np.broadcast_to(w, x.shape)[rows, cols]  # [0.3, 0.3, 0.4, 0.3, 0.4, 0.3, 0.3, 0.4, 0.3, 0.4, 0.3, 0.3]

You can find the breakpoints to apply np.add.reduce at:

row_breaks = np.diff(rows).astype(bool)            # [0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0]
col_breaks = np.diff(u).astype(bool)               # [0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0]
break_mask = row_breaks | col_breaks               # [0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0]
breaks = np.r_[0, np.flatnonzero(break_mask) + 1]  # [ 0,  2,  3,  4,  5,  6,  7,  9, 10]

Now you have the sums of the weights for identical numbers in each row:

sums = np.add.reduceat(v, breaks)  # [0.6, 0.4, 0.3, 0.4, 0.3, 0.3, 0.7, 0.4, 0.6]

But you need to break them up into segments corresponding to the number of unique elements per row:

unique_counts = np.add.reduceat(break_mask, np.arange(0, x.size, x.shape[1]))
unique_counts[-1] += 1  # The last segment will be missing from the mask: # [2, 3, 2, 2]

unique_rows = np.repeat(np.arange(x.shape[0]), unique_counts)  # [0, 0, 1, 1, 1, 2, 2, 3, 3]

You can now sort each segment to find the maximum value:

indices = np.lexsort(np.stack((sums, unique_rows), axis=0))  # [1, 0, 2, 4, 3, 5, 6, 7, 8]

The index at the end of each run is given by:

max_inds = np.cumsum(unique_counts) - 1  # [1, 4, 6, 8]

So the maximum sums are:

sums[indices[max_inds]]  # [0.6, 0.4, 0.7, 0.6]

And you can unravel the indices-within indices to get the correct element from each row. Notice that max_inds, and everything that depends on it is the same size as x.shape[1], as expected:

result = u[breaks[indices[max_ind]]]

This method does not look very pretty, but it is likely more space efficient than using an extra dimension on the array. Additionally, it works regardless of the numbers in x. Notice that I never subtracted anything or adjusted x in any way. In fact, all the rows are treated independently, and the coincidence of a maximum element being identical to the minimum of the next is broken by row_breaks when constructing breaks.

TL;DR

Enjoy:

def weighted_vote(x, w):
    rows = np.repeat(np.arange(x.shape[0]), x.shape[1])
    cols = np.argsort(x, axis=1).ravel()
    u = x[rows, cols]
    v = np.broadcast_to(w, x.shape)[rows, cols]
    row_breaks = np.diff(rows).astype(bool)
    col_breaks = np.diff(u).astype(bool)
    break_mask = row_breaks | col_breaks
    breaks = np.r_[0, np.flatnonzero(break_mask) + 1]
    sums = np.add.reduceat(v, breaks)
    unique_counts = np.add.reduceat(break_mask, np.arange(0, x.size, x.shape[1]))
    unique_counts[-1] += 1
    unique_rows = np.repeat(np.arange(x.shape[0]), unique_counts)
    indices = np.lexsort(np.stack((sums, unique_rows), axis=0))
    max_inds = np.cumsum(unique_counts) - 1
    return u[breaks[indices[max_inds]]]

Benchmarks

Benchmarks are run in following format:

rows = ...
cols = ...
x = np.random.randint(cols, size=(rows, cols)) + 1
w = np.random.rand(cols)
%timeit weighted_vote_MP(x, w)
%timeit weighted_vote_JG(x, w)
assert (weighted_vote_MP(x, w) == weighted_vote_JG(x, w)).all()

I used the following generalization for weighted_vote_JG, with appropriate corrections:

def weighted_vote_JG(x, w):
    i = np.arange(w.size) + 1
    m = (x[None, ...] == i.reshape(-1, 1, 1))
    return np.argmax(np.sum(m * w, axis=2), axis=0) + 1

Rows: 100, Cols: 10

  MP: 440 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
* JG: 153 µs ± 796 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Rows: 1000, Cols: 10

  MP: 2.53 ms ± 43.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
* JG: 1.03 ms ± 12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Rows: 10000, Cols: 10

  MP: 23.5 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
* JG: 16.6 ms ± 67.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Rows: 100000, Cols: 10

  MP: 322 ms ± 3.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
* JG: 188 ms ± 858 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Rows: 100, Cols: 100

* MP: 3.31 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  JG: 12.6 ms ± 244 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Rows: 1000, Cols: 100

* MP: 31 ms ± 159 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  JG: 134 ms ± 581 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Rows: 10000, Cols: 100

* MP: 417 ms ± 7.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  JG: 1.42 s ± 126 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Rows: 100000, Cols: 100

* MP: 4.94 s ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  JG: MemoryError: Unable to allocate 7.45 GiB for an array with shape (100, 100000, 100) and data type float64

Moral of the story: for small number of columns and weights, the expanded solution is faster. For a larger number of columns, use my version instead.

like image 36
Mad Physicist Avatar answered Nov 13 '22 10:11

Mad Physicist