I want to find a library or an algorithm (so I write the code myself) for identifying the nearest k neighbours of a webpage, where the webpage is defined as being a set of keywords. I have already done the part where I extract the keywords.

It doesn't have to be very good, just good enough.

Can anyone suggest a solution, or where to start. I have looked through lectures by Yury Lifshits in the past, but I am hoping to get something ready-made if possible.

Java libraries preferred.

asked Nov 05 '22 05:11
#### Ankur

As you said, you already have the keywords extracted from a page. I am assuming that you represent each document/page by a vector of words. Something like a document term-frequency matrix.

I guess the nearest neighbour of a page is ideally a page with similar contents. So you'd like to find documents where the relative frequency of each word is similar to the one you are searching for. So first normalize the doc-term matrix WRT each row; i.e. replace the occurrence count by %tage occurrence.

Next you have to assign some distance between 2 documents represented by these vectors. You can use the normal Euclidean distance or Manhattan Distance. However for text document the similarity measure that usually works best is Cosine Similarity. Use whatever distance or similarity function suits your problem (remember for nearest neighbour you want to minimize the distance; but maximize similarity).

Once you have the vectors and your distance function in place, run the Nearest neighbour or the K-Nearest neighbour algorithm.

answered Nov 13 '22 00:11
#### BiGYaN

### Recent Activity

- Apple Pay - authorize.net returns error 153 only when live, sandbox works
- How to continue cursor loop even error occured in the loop
- python find all neighbours of a given node in a list of lists
- Fatal error: Call to a member function setColumn() on a non-object in Magento
- Count how many of each value from a field with MySQL and PHP
- Python 32-bit development on 64-bit Windows [closed]

If you love us? You can donate to us via Paypal or buy me a coffee
so we can maintain and grow! **Thank you!**