Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find out weights of attributes in K-nearest neighbors algorithm?

I have such code in python with dataset of house prices:

from sklearn.datasets import load_boston
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import scale

boston = load_boston()
y = boston.target
X = scale(boston.data)
knn = KNeighborsRegressor(n_neighbors=5, weights='distance', metric='minkowski', p=1)
knn.fit(X, y)

And now I can predict target attribute, in this case it's price:

knn.predict([-0.41771335,  0.28482986, -1.2879095 , ..., -1.45900038,
     0.44105193, -1.0755623 ])

As I understand this algorithm should find weights for each attribute to make a distance function. Where can I find computed weights of each attribute? I wonder what attribute has strongest correlation with house price.

like image 302
Timrael Avatar asked Feb 08 '23 13:02

Timrael


1 Answers

You actually specify the weights via the metric argument.

First off, your question details are slightly incorrect. The algorithm doesn't find a distance function - you supply it with a metric in which to compute distances, and a function to compute weights as a function of those distances. You are using the default distance metric which, according to the docs is just the good old Euclidean distance.

Weights are computed as the inverse of distance (also written in the docs), so you can manually find the k neighbors of a given point and compute their weights using the build in kneighbors method to find neighbors:

test = [[np.random.uniform(-1, 1) for _ in xrange(len(X[0]))]]

neighbors, distances = knn.kneighbors(test)
for d in distances:
    weight = 1.0/d
    print weight

The problem is that all features enter into the calculation of d with equal weight because you've specified a Euclidean metric, i.e. d is the square root of

1*(x1_neighbor - x1_test)^2 + 1*(x2_neighbor - x2_test)^2 + ...

This is because the Minkowsky metric is just a matrix with ones along the diagonal. If you want different weights, you can specify an alternate metric. However, if you just want a quick and dirty way of telling how important the various features are, a typical way of estimating the importance of feature i is to randomly permute all values of feature i and see how much it hurts the performance of the regressor. You can read more about that here.

like image 86
bjarkemoensted Avatar answered Feb 14 '23 10:02

bjarkemoensted