Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keyword based nearest neighbour algorithm or library

I want to find a library or an algorithm (so I write the code myself) for identifying the nearest k neighbours of a webpage, where the webpage is defined as being a set of keywords. I have already done the part where I extract the keywords.

It doesn't have to be very good, just good enough.

Can anyone suggest a solution, or where to start. I have looked through lectures by Yury Lifshits in the past, but I am hoping to get something ready-made if possible.

Java libraries preferred.

like image 247
Ankur Avatar asked Nov 04 '22 22:11

Ankur


1 Answers

As you said, you already have the keywords extracted from a page. I am assuming that you represent each document/page by a vector of words. Something like a document term-frequency matrix.

I guess the nearest neighbour of a page is ideally a page with similar contents. So you'd like to find documents where the relative frequency of each word is similar to the one you are searching for. So first normalize the doc-term matrix WRT each row; i.e. replace the occurrence count by %tage occurrence.

Next you have to assign some distance between 2 documents represented by these vectors. You can use the normal Euclidean distance or Manhattan Distance. However for text document the similarity measure that usually works best is Cosine Similarity. Use whatever distance or similarity function suits your problem (remember for nearest neighbour you want to minimize the distance; but maximize similarity).

Once you have the vectors and your distance function in place, run the Nearest neighbour or the K-Nearest neighbour algorithm.

like image 145
BiGYaN Avatar answered Nov 12 '22 17:11

BiGYaN