I want to find a library or an algorithm (so I write the code myself) for identifying the nearest k neighbours of a webpage, where the webpage is defined as being a set of keywords. I have already done the part where I extract the keywords.
It doesn't have to be very good, just good enough.
Can anyone suggest a solution, or where to start. I have looked through lectures by Yury Lifshits in the past, but I am hoping to get something ready-made if possible.
Java libraries preferred.
As you said, you already have the keywords extracted from a page. I am assuming that you represent each document/page by a vector of words. Something like a document term-frequency matrix.
I guess the nearest neighbour of a page is ideally a page with similar contents. So you'd like to find documents where the relative frequency of each word is similar to the one you are searching for. So first normalize the doc-term matrix WRT each row; i.e. replace the occurrence count by %tage occurrence.
Next you have to assign some distance between 2 documents represented by these vectors. You can use the normal Euclidean distance or Manhattan Distance. However for text document the similarity measure that usually works best is Cosine Similarity. Use whatever distance or similarity function suits your problem (remember for nearest neighbour you want to minimize the distance; but maximize similarity).
Once you have the vectors and your distance function in place, run the Nearest neighbour or the K-Nearest neighbour algorithm.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With