Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling Incomplete Data (Data Sparsity) in kNN

I am trying to create a simple recommender system using knn.

Lets say I have some a table:

User | Book1 | Book2 | Book3 | Book4 | Book5 | Book6 | Book7 |
1    | 5     | ?     | 3     | ?     | 4     | 3     | 2     |
2    | 3     | 4     | ?     | 2     | 3     | 4     | 2     |
3    | 4     | 2     | 1     | ?     | ?     | 3     | 3     |
4    | 2     | 5     | 3     | ?     | 4     | 1     | 1     |
5    | 1     | 1     | 4     | 3     | 1     | ?     | 1     |
6    | 5     | 2     | 5     | 4     | 4     | 2     | ?     |

So if to find the possible scores for User 1, I was thinking that just take the absolute difference of the books user 1 read with other users. Then I would use that difference to find out which user from that list is "closest" to user 1. But in real world situation, there would be more ?/unknown scores. So how do I deal with those unknown scores when using knn?

I don't have any code, as I've yet to really understand how to implement this.

Any help is appreciated!

like image 916
iCodeLikeImDrunk Avatar asked May 06 '12 17:05

iCodeLikeImDrunk


People also ask

What is the sparsity problem in KNN?

Although the problem is actually an "incomplete data" problem, in the kNN context it's often ( usually?) referred to as the sparsity problem. In practice, the sparsity problem in building knn models is, with the possible exception of efficient storage/retrieval of the data that comprise the model, the crux of kNN.

What is the use of KNN in statistics?

It can be used for data that are continuous, discrete, ordinal and categorical which makes it particularly useful for dealing with all kind of missing data. The assumption behind using KNN for missing values is that a point value can be approximated by the values of the points that are closest to it, based on other variables.

How do you use KNN for missing values?

The assumption behind using KNN for missing values is that a point value can be approximated by the values of the points that are closest to it, based on other variables. Let’s keep the previous example and add another variable, the income of the person.

What is the best possible method of handling missing data?

The best possible method of handling the missing data is to prevent the problem by well-planning the study and collecting the data carefully. The following are suggested to minimize the amount of missing data in the clinical research. First, the study design should limit the collection of data to those who are participating in the study.


2 Answers

You don't have "unknown features" you have incomplete data points.

This is actually well-known problem in kNN and and there is a thoroughly validated pattern for dealing with it.

Although the problem is actually an "incomplete data" problem, in the kNN context it's often (usually?) referred to as the sparsity problem.

In practice, the sparsity problem in building knn models is, with the possible exception of efficient storage/retrieval of the data that comprise the model, the crux of kNN.

For instance, consider Amazon.com's recommendation engine, in which product ratings as user features comprising the columns and users comprising the rows, for this matrix to be 100% complete, every Amazon customer would have to have purchased and reviewed every single porduct Amazon sells. The actual sparsity of this matrix must be > 95%.

The most common technique (and which is still state-of-the-art as far as i know) is known as NNMA, or non-negative matrix approximation. This technique is also often referred to incorrectly as NNMF, in which F stands for factorization. (NNMA is based on a factorization technique, but the result is not factors of the original data matrix.) I mention this because this alternate term, though incorrect is widely used so i would include it in my search engine queries.

In essence, this techique can be used to remove sparsity from a matrix, or put another way, to populate the missing cells (i.e., the customer at row R has not reviwed the product of column C).

You can find a complete implementation of nnma, including an accompanying tutorial (in python + numpy) in Albert Au Yeung Ching-man's blog.

Alternatively, there are several python packages (available via PyPI) that contain packaged code for NNMA. I have only used one of these, PyMF, which you can find at Google Code.

So that you can see how NNMA works its magic, here is my simple but complete implementation of NNMA in python + NumPy:

import numpy as NP

def cf(q, v):
    """ the cost function """
    qv = (q - v)**2
    return NP.sum(NP.sum(qv, axis=0))


def nnma(d, max_iter=100):
    x, y = d.shape
    z = y
    w = NP.random.rand(x, y)
    h = NP.random.rand(y, z)
    for i in range(max_iter):
        wh = NP.dot(w, h)
        cost = cf(d, wh)
        if cost == 0: 
            break
        hn = NP.dot(w.T, d)
        hd = NP.dot(NP.dot(w.T, w), h)
        h *= hn/hd
        wn = NP.dot(d, h.T)
        wd = NP.dot(NP.dot(w, h), h.T)
        w *= wn/wd
    return NP.dot(w, h)

To use this NNMA function, just pass in a 2D array (matrix) with a "0" for each missing cell (in other words, your data matrix, with a "0" inserted for each missing value):

>>> d    # the original (sparse) data matrix with missing cells denoted by "0"s

  array([[ 7.,  0.,  4.,  7.,  0.,  1.],
         [ 3.,  9.,  7.,  3.,  1.,  7.],
         [ 4.,  4.,  3.,  7.,  3.,  9.],
         [ 4.,  8.,  0.,  9.,  2.,  1.],
         [ 6.,  3.,  9.,  5.,  9.,  3.],
         [ 6.,  1.,  4.,  4.,  1.,  0.],
         [ 0.,  4.,  8.,  6.,  0.,  5.],
         [ 9.,  0.,  6.,  0.,  5.,  2.],
         [ 6.,  8.,  4.,  6.,  3.,  7.],
         [ 3.,  6.,  3.,  8.,  7.,  2.]])

>>> d1 = nnma(d)     # call nnma, passing in the original data matrix

>>> d1    # the approximated data matrix with all missing values populated

   array([[ 6.998,  0.29 ,  3.987,  7.008,  0.292,  0.796],
          [ 2.989,  8.92 ,  6.994,  3.02 ,  1.277,  7.053],
          [ 4.007,  4.496,  2.999,  7.01 ,  3.107,  8.695],
          [ 4.005,  8.019,  0.254,  9.002,  1.917,  0.89 ],
          [ 5.998,  3.014,  9.001,  4.991,  8.983,  3.052],
          [ 5.992,  1.077,  4.007,  3.976,  0.753,  0.464],
          [ 0.346,  3.436,  7.993,  5.988,  0.194,  5.355],
          [ 9.001,  0.124,  5.997,  0.375,  5.02 ,  1.867],
          [ 6.   ,  7.994,  3.998,  6.   ,  2.999,  7.009],
          [ 2.995,  6.022,  3.001,  7.987,  6.939,  2.185]])

So as you can see, the results aren't too bad, particularly for a very simple implementation. All of the missing items are populated, and the remainder of the values are pretty close to the corresponding value from the original data matrix, e.g., column 0, row 0 is 7.0 in the original data matrix, and 6.998 in the approximated one.

like image 133
doug Avatar answered Oct 29 '22 00:10

doug


The piece you're missing is the method for measuring distances. The Pearson correlation is one of the most widely used methods. The Cosine distance is another one. The L1 distance (sum of absolute of differences) usually doesn't give good results.

If you google you will find what's the recommended way of dealing with missing values based on the similarity distance you use. For example, in Pearson only the books rated commonly by two users are used to measure the correlation, hence the missing values are simply ignored. This makes sense, as if a small proportion of books read by two users are in common that most likely implies that have different tastes. In the Cosine distance the missing values can be assumed zero.

The other commonly used approach is to impute missing values. You could for example first use Pearson to find the similarity between books and then for each person predict the missing ratings.

like image 39
fireant Avatar answered Oct 29 '22 00:10

fireant