Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Machine Learning Algorithm for Completing Sparse Matrix Data

I've seen some machine learning questions on here so I figured I would post a related question:

Suppose I have a dataset where athletes participate at running competitions of 10 km and 20 km with hilly courses i.e. every competition has its own difficulty.

The finishing times from users are almost inverse normally distributed for every competition.

One can write this problem as a matrix:

       Comp1 Comp2 Comp3
User1  20min  ??   10min

User2  25min 20min 12min

User3  30min 25min ??

User4  30min ??    ??

I would like to complete the matrix above which has the size 1000x20 and a sparseness of 8 % (!).

There should be a very easy way to complete this matrix, since I can calculate parameters for every user (ability) and parameters for every competition (mu, lambda of distributions). Moreover the correlation between the competitions are very high.

I can take advantage of the rankings User1 < User2 < User3 and Item3 << Item2 < Item1

Could you maybe give me a hint which methods I could use?

like image 664
user1141785 Avatar asked Nov 21 '12 16:11

user1141785


1 Answers

Your astute observation that this is a matrix completion problem gets you most of the way to the solution. I'll codify your intuition that the combination of ability of a user and difficulty of the course yields the time of a race, then present various algorithms.

Model

Let the vector u denote the speed of the users so that u_i is user i's speed. Let the vector v denote the difficulty of the courses so that v_j is course j's difficulty. Also when available, let t_ij be user i's time on course j, and define y_ij = 1/t_ij, user i's speed on course j.

Since you say the times are inverse Gaussian distributed, a sensible model for the observations is

y_ij = u_i * v_j + e_ij,

where e_ij is a zero-mean Gaussian random variable.

To fit this model, we search for vectors u and v that minimize the prediction error among the observed speeds:

f(u,v) = sum_ij (u_i * v_j - y_ij)^2

Algorithm 1: missing value Singular Value Decomposition

This is the classical Hebbian algorithm. It minimizes the above cost function by gradient descent. The gradient of f wrt to u and v are

df/du_i = sum_j (u_i * v_j - y_ij) v_j
df/dv_j = sum_i (u_i * v_j - y_ij) u_i

Plug these gradients into a Conjugate Gradient solver or BFGS optimizer, like MATLAB's fmin_unc or scipy's optimize.fmin_ncg or optimize.fmin_bfgs. Don't roll your own gradient descent unless you're willing to implement a very good line search algorithm.

Algorithm 2: matrix factorization with a trace norm penalty

Recently, simple convex relaxations to this problem have been proposed. The resulting algorithms are just as simple to code up and seem to work very well. Check out, for example Collaborative Filtering in a Non-Uniform World: Learning with the Weighted Trace Norm. These methods minimize f(m) = sum_ij (m_ij - y_ij)^2 + ||m||_*, where ||.||_* is the so-called nuclear norm of the matrix m. Implementations will end up again computing gradients with respect to u and v and relying on a nonlinear optimizer.

like image 132
moos Avatar answered Nov 11 '22 06:11

moos