Handling Incomplete Data (Data Sparsity) in kNN

Tags:

I am trying to create a simple recommender system using knn.

Lets say I have some a table:

User | Book1 | Book2 | Book3 | Book4 | Book5 | Book6 | Book7 |
1    | 5     | ?     | 3     | ?     | 4     | 3     | 2     |
2    | 3     | 4     | ?     | 2     | 3     | 4     | 2     |
3    | 4     | 2     | 1     | ?     | ?     | 3     | 3     |
4    | 2     | 5     | 3     | ?     | 4     | 1     | 1     |
5    | 1     | 1     | 4     | 3     | 1     | ?     | 1     |
6    | 5     | 2     | 5     | 4     | 4     | 2     | ?     |

So if to find the possible scores for User 1, I was thinking that just take the absolute difference of the books user 1 read with other users. Then I would use that difference to find out which user from that list is "closest" to user 1. But in real world situation, there would be more ?/unknown scores. So how do I deal with those unknown scores when using knn?

I don't have any code, as I've yet to really understand how to implement this.

Any help is appreciated!

916

asked May 06 '12 17:05

iCodeLikeImDrunk

2 Answers

You don't have "unknown features" you have incomplete data points.

This is actually well-known problem in kNN and and there is a thoroughly validated pattern for dealing with it.

Although the problem is actually an "incomplete data" problem, in the kNN context it's often (usually?) referred to as the sparsity problem.

In practice, the sparsity problem in building knn models is, with the possible exception of efficient storage/retrieval of the data that comprise the model, the crux of kNN.

For instance, consider Amazon.com's recommendation engine, in which product ratings as user features comprising the columns and users comprising the rows, for this matrix to be 100% complete, every Amazon customer would have to have purchased and reviewed every single porduct Amazon sells. The actual sparsity of this matrix must be > 95%.

The most common technique (and which is still state-of-the-art as far as i know) is known as NNMA, or non-negative matrix approximation. This technique is also often referred to incorrectly as NNMF, in which F stands for factorization. (NNMA is based on a factorization technique, but the result is not factors of the original data matrix.) I mention this because this alternate term, though incorrect is widely used so i would include it in my search engine queries.

In essence, this techique can be used to remove sparsity from a matrix, or put another way, to populate the missing cells (i.e., the customer at row R has not reviwed the product of column C).

You can find a complete implementation of nnma, including an accompanying tutorial (in python + numpy) in Albert Au Yeung Ching-man's blog.

Alternatively, there are several python packages (available via PyPI) that contain packaged code for NNMA. I have only used one of these, PyMF, which you can find at Google Code.

So that you can see how NNMA works its magic, here is my simple but complete implementation of NNMA in python + NumPy:

import numpy as NP

def cf(q, v):
    """ the cost function """
    qv = (q - v)**2
    return NP.sum(NP.sum(qv, axis=0))


def nnma(d, max_iter=100):
    x, y = d.shape
    z = y
    w = NP.random.rand(x, y)
    h = NP.random.rand(y, z)
    for i in range(max_iter):
        wh = NP.dot(w, h)
        cost = cf(d, wh)
        if cost == 0: 
            break
        hn = NP.dot(w.T, d)
        hd = NP.dot(NP.dot(w.T, w), h)
        h *= hn/hd
        wn = NP.dot(d, h.T)
        wd = NP.dot(NP.dot(w, h), h.T)
        w *= wn/wd
    return NP.dot(w, h)

To use this NNMA function, just pass in a 2D array (matrix) with a "0" for each missing cell (in other words, your data matrix, with a "0" inserted for each missing value):

>>> d    # the original (sparse) data matrix with missing cells denoted by "0"s

  array([[ 7.,  0.,  4.,  7.,  0.,  1.],
         [ 3.,  9.,  7.,  3.,  1.,  7.],
         [ 4.,  4.,  3.,  7.,  3.,  9.],
         [ 4.,  8.,  0.,  9.,  2.,  1.],
         [ 6.,  3.,  9.,  5.,  9.,  3.],
         [ 6.,  1.,  4.,  4.,  1.,  0.],
         [ 0.,  4.,  8.,  6.,  0.,  5.],
         [ 9.,  0.,  6.,  0.,  5.,  2.],
         [ 6.,  8.,  4.,  6.,  3.,  7.],
         [ 3.,  6.,  3.,  8.,  7.,  2.]])

>>> d1 = nnma(d)     # call nnma, passing in the original data matrix

>>> d1    # the approximated data matrix with all missing values populated

   array([[ 6.998,  0.29 ,  3.987,  7.008,  0.292,  0.796],
          [ 2.989,  8.92 ,  6.994,  3.02 ,  1.277,  7.053],
          [ 4.007,  4.496,  2.999,  7.01 ,  3.107,  8.695],
          [ 4.005,  8.019,  0.254,  9.002,  1.917,  0.89 ],
          [ 5.998,  3.014,  9.001,  4.991,  8.983,  3.052],
          [ 5.992,  1.077,  4.007,  3.976,  0.753,  0.464],
          [ 0.346,  3.436,  7.993,  5.988,  0.194,  5.355],
          [ 9.001,  0.124,  5.997,  0.375,  5.02 ,  1.867],
          [ 6.   ,  7.994,  3.998,  6.   ,  2.999,  7.009],
          [ 2.995,  6.022,  3.001,  7.987,  6.939,  2.185]])

So as you can see, the results aren't too bad, particularly for a very simple implementation. All of the missing items are populated, and the remainder of the values are pretty close to the corresponding value from the original data matrix, e.g., column 0, row 0 is 7.0 in the original data matrix, and 6.998 in the approximated one.

133

answered Oct 29 '22 00:10

doug

The piece you're missing is the method for measuring distances. The Pearson correlation is one of the most widely used methods. The Cosine distance is another one. The L1 distance (sum of absolute of differences) usually doesn't give good results.

If you google you will find what's the recommended way of dealing with missing values based on the similarity distance you use. For example, in Pearson only the books rated commonly by two users are used to measure the correlation, hence the missing values are simply ignored. This makes sense, as if a small proportion of books read by two users are in common that most likely implies that have different tastes. In the Cosine distance the missing values can be assumed zero.

The other commonly used approach is to impute missing values. You could for example first use Pearson to find the similarity between books and then for each person predict the missing ratings.

answered Oct 29 '22 00:10

fireant

Related questions
                            
                                Take every k-th element from the (1 .. n) natural numbers series
                            
                                Faster way to read/write a std::unordered_map from/to a file
                            
                                How to find std::max_element on std::vector<std::pair<int, int>> in either of the axis?
                            
                                US Phone Number Verification
                            
                                are they adding copy_if to c++0x?
                            
                                Minimum cost strongly connected digraph
                            
                                Algorithm complexity with input is fix-sized
                            
                                Dynamic programming algorithm N, K problem
                            
                                Algorithm to find lenth of longest sequence of blanks in a given string
                            
                                Any idea how to transform this O(n^2) algo into a O(n)
                            
                                Tricky algorithm question [duplicate]
                            
                                binary divisibility by 3 [duplicate]
                            
                                Construct tree with pre-order traversal given
                            
                                Algorithmic staff scheduling solutions
                            
                                Generating unique codes that are different in two digits
                            
                                Finding minimum cut edges in a graph
                            
                                What are sentinel in C language? I was learning Merge sort and came across using sentinel as infinity in the merge step
                            
                                What is the best algorithm to see if my number is in an array of ranges?
                            
                                String with array structure to Array [duplicate]
                            
                                Finding integer power roots

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Handling Incomplete Data (Data Sparsity) in kNN

Tags:

algorithm

machine-learning

classification

sparse-matrix

knn

iCodeLikeImDrunk

People also ask

2 Answers

doug

fireant

Recent Activity

Donate For Us