Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What data do I need to implement k nearest neighbor?

I currently have a reddit-clone type website. I'm trying to recommend posts based on the posts that my users have previously liked.

It seems like K nearest neighbor or k means are the best way to do this.

I can't seem to understand how to actually implement this. I've seen some mathematical formulas (such as the one on the k means wikipedia page), but they don't really make sense to me.

Could someone maybe recommend some pseudo code, or places to look so I can get a better feel on how to do this?

like image 614
icco Avatar asked Jun 01 '11 18:06

icco


People also ask

Which is the first step in K nearest neighbor learning algorithm?

Step 1 − For implementing any algorithm, we need dataset. So during the first step of KNN, we must load the training as well as test data. Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any integer.


2 Answers

K-Nearest Neighbor (aka KNN) is a classification algorithm.

Basically, you take a training group of N items and classify them. How you classify them is completely dependent on your data, and what you think the important classification characteristics of that data are. In your example, this may be category of posts, who posted the item, who upvoted the item, etc.

Once this 'training' data has been classified, you can then evaluate an 'unknown' data point. You determine the 'class' of the unknown by locating the nearest neighbors to it in the classification system. If you determine the classification by the 3 nearest neighbors, it could then be called a 3-nearest neighboring algorithm.

How you determine the 'nearest neighbor' depends heavily on how you classify your data. It is very common to plot the data into N-dimensional space where N represents the number of different classification characteristics you are examining.

A trivial example:

Let's say you have the longitude/latitude coordinates of a location that can be on any landmass anywhere in the world. Let us also assume that you do not have a map, but you do have a very large data set that gives you the longitude/latitude of many different cities in the world, and you also know which country those cities are in.

If I asked you which country the a random longitude latitude point is in, would you be able to figure it out? What would you do to figure it out?

Longitude/latitude data falls naturally into an X,Y graph. So, if you plotted out all the cities onto this graph, and then the unknown point, how would you figure out the country of the unknown? You might start drawing circles around that point, growing increasingly larger until the circle encompasses the 10 nearest cities on the plot. Now, you can look at the countries of those 10 cities. If all 10 are in the USA, then you can say with a fair degree of certainty that your unknown point is also in the USA. But if only 6 cities are in the USA, and the other 4 are in Canada, can you say where your unknown point is? You may still guess USA, but with less certainty.

The toughest part of KNN is figuring out how to classify your data in a way that you can determine 'neighbors' of similar quality, and the distance to those neighbors.

like image 136
Matt Avatar answered Sep 18 '22 01:09

Matt


What you described sounds like a recommender system engine, not a clustering algorithm like k-means which in essence is an unsupervised approach. I cannot make myself a clear idea of what reddit uses actually, but I found some interesting post by googling around "recommender + reddit", e.g. Reddit, Stumbleupon, Del.icio.us and Hacker News Algorithms Exposed! Anyway, the k-NN algorithm (described in the top ten data mining algorithm, with pseudo-code on Wikipedia) might be used, or other techniques like Collaborative filtering (used by Amazon, for example), described in this good tutorial.

like image 39
chl Avatar answered Sep 17 '22 01:09

chl