Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a faster way to do this pseudo code efficiently in python numpy?

Tags:

python

numpy

I have three arrays called RowIndex, ColIndex and Entry in numpy. Essentially, this is a subset of entries from a matrix with the row indexes, column indexes, and value of that entry in these three variables respectively. I have two numpy 2D arrays (matrices) U and M. Let alpha and beta be two given constants. I need to iterate through the subset of entries of the matrix which is possible if I iterate through RowIndex, ColIndex and Value. Say,

i=RowIndex[0], j=ColIndex[0], value = Entry[0] 

then I need to update i'th row and j'th column of U and M respectively according to some equation. Then, I make

i=RowIndex[1], j=ColIndex[1], value = Entry[1]

and so on. The detail is below.

for iter in np.arange(length(RowIndex)):
    i = RowIndex[iter]
    j = ColIndex[iter]
    value = Entry[iter]
    e = value - np.dot(U[i,:],M[:,j])
    OldUi = U[i,:]
    OldMj = M[:,j]
    U[i,:] = OldUi + beta * (e*OldMj - alpha*OldUi)
    M[:,j] = OldMj + beta * (e*OldUi - alpha*OldMj)

The problem is that the code is extremely slow. Is there any portion of code where I can speed this up?

PS: For the curious ones, this is a variant of the prize-winning solution to the famous NetFlix million prize problem. RowIndex corresponds to users and ColIndex correspond to movies and values corresponding to their ratings. Most of the ratings are missing. Known ratings are stacked up in RowIndex, ColIndex and Entry. Now you try to find matrices U and M, such that, the rating of i'th user for j'th movie is given by np.dot(U[i,:],M[:,j]). Now based on the available ratings, you try to find the matrices U and M (or their rows and columns) using a update equation as shown in the above code.

like image 318
dineshdileep Avatar asked Sep 28 '22 21:09

dineshdileep


People also ask

How do I speed up NumPy in Python?

By explicitly declaring the "ndarray" data type, your array processing can be 1250x faster. This tutorial will show you how to speed up the processing of NumPy arrays using Cython. By explicitly specifying the data types of variables in Python, Cython can give drastic speed increases at runtime.

Is NumPy faster than Python?

NumPy Arrays are faster than Python Lists because of the following reasons: An array is a collection of homogeneous data-types that are stored in contiguous memory locations. On the other hand, a list in Python is a collection of heterogeneous data types stored in non-contiguous memory locations.

Is NumPy as fast as C++?

The Python code can't be faster than properly-coded C++ code since Numpy is coded in C, which is often slower than C++ since C++ can do more optimizations.

Which is faster NumPy array or list?

As predicted, we can see that NumPy arrays are significantly faster than lists.


1 Answers

I think if I didn't understand wrong, that your code can be vectorized as follows:

import numpy as np

U, M = # two 2D matrices
rows_idx = # list of indexes
cols_idx = # list of indexes
values   = # np.array() of values

e = values - np.dot(U[rows_idx, :], M[:, cols_idx]).diagonal()
Uo = U.copy()
Mo = M.copy()
U[rows_idx, :] += beta * ((e * Mo[:, cols_idx]).T - alpha * Uo[rows_idx, :])
M[:, cols_idx] += beta * ((e * Uo[rows_idx, :].T) - alpha * Mo[:, cols_idx])

Here,

e = values - np.dot(U[rows_idx, :], M[:, cols_idx]).diagonal()

computes your

e = value - np.dot(U[i,:],M[:,j])

Note that the result you want resides in the diagonal of the dot product between matrices.

This wont handle sequential updates (as for that there is no available vectorization), but it will allow you to perform a batch of independent updates in a vectorized and faster way.


As stated above, the code I proposed to you can't handle sequential updates, because by definition, a sequential updating scheme can't be vectorized. Anything of the form

A(t) = A(t-1) +/* something

where t defines time, can't be updated in parallel.

So, what I proposed, is a vectorized update for independent updates.

Imagine you have M and U with 10x10 rows each, and you have the following row and columns indexes:

rows_idx = [1, 1, 3, 4, 5, 0]
cols_idx = [7, 1, 7, 5, 6, 5]

You can identify from there two independent sets (considering that indexes are ordered):

rows_idx = [1, 4, 5], [1, 3, 0]
cols_idx = [7, 5, 6], [1, 7, 5]

Note that independent sets are made by indexes in both rows and columns that are unique. With that definition, you can reduce the number of loops you need from 6 (in this case) to 2:

for i in len(rows_idx):
    ridx = rows_idx[i]
    cidx = cols_idx[i]
    # Use the vectorized scheme proposed above the edit
    e = values - np.dot(U[ridx, :], M[:, cidx]).diagonal()
    Uo = U.copy()
    Mo = M.copy()
    U[ridx, :] += beta * ((e * Mo[:, cidx]).T - alpha * Uo[ridx, :])
    M[:, cidx] += beta * ((e * Uo[ridx, :].T) - alpha * Mo[:, cidx])

So, either if you have a way of manually (or easily) extracting the independent updates, or you calculate the list by a using search algorithm, the above code would vectorize the independent updates.


For clarification just in case, in the above example:

rows_idx = [1, 1, 3, 4, 5, 0]
cols_idx = [7, 1, 7, 5, 6, 5]

2nd row can't be parallelized because 1 has appeared before, and 3rd and last columns can't be parallelized because of the same reason (with 7 and 5). So as both rows and columns need to be unique, we end up with 2 sets of tuples:

rows_idx = [1, 4, 5], [1, 3, 0]
cols_idx = [7, 5, 6], [1, 7, 5]

From here, the way to go would depend on your data. The problem of finding independent sets could be very expensive, specially if most of them are dependent on some previous updates.

If you have a way from your data (say that you have your data recorded on time) to extract independent sets, then the batch update will help you. In the other hand, if you have your data all together (which is common), it will depend on one factor:

If you can assure that the length of the independent sets N is very much larger than the number of independent sets M (which more or less means, that if you will end up with a few M = {2,3,4} independent sets for your N = 100000, with N >> M row/col indexes), then it might worth looking for independent sets.

In other words, if you are going to update 30 authors and 30 movies in 10000 different combinations, then your data will be likely to be dependent in previous updates, however, if you are going to update 100000 authors and 100000 movies in 30 combinations, then your data is likely to be independent.

Some pseudocode to find independent set, if you don't have a way of extracting them without information, would be something like this:

independent_sets = [] # list with sets

for row, col in zip(rows_idx, cols_idx):
    for iset in independent_sets:
        if row and col DONT exist in iset:
            insert row and col
            break
    if nothing inserted:
        add new set to independent set
        add current (row, col) to the new set

as you can see, in order to find independent sets you already need to iterate over the whole list of row/column indexes. The pseudocode above is not the most efficent one, and I'm pretty sure there will be specific algorithms for this. But, the cost of finding independent set might be higher than doing all your sequential updates if your updates are likely to be dependent in previous ones.

To finish: after the whole post, it entirely depends on your data.

  • If you can beforehand from the way you get the rows/columns you want to update extract independent sets, then you can easily update them vectorized.

  • If you can ensure that most of your updates will be independent (say, 990 out of 10000 will be), it might be worth trying to find the 990 set. One way to approximate the set is by using np.unique:

    # Just get the index of the unique rows and columns
    _, idx_rows = np.unique(rows_idx, return_index=True) 
    _, idx_cols = np.unique(cols_idx, return_index=True)
    
    # Get the index where both rows and columns are unique
    idx = np.intersection1d(idx_rows, idx_cols)
    

    Now idx contains the positions of rows_idx and cols_idx that are unique, hopefully this can reduce your computational cost a lot. You can use my batch update to update fast those rows and columns corresponding to those indexes. You can then use your initial approach to update the hopefully few entries that are repeated iterating over the non-unique indexes.

  • If you have multiple updates for same actors or movies, then... keep your sequential update scheme, as finding independent sets will be harder than iterative update.

like image 154
Imanol Luengo Avatar answered Oct 26 '22 12:10

Imanol Luengo