Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement a "related" degree measure algorithm?

I was going to Ask a Question earlier today when I was presented to a surprising functionality in Stackoverflow. When I wrote my question title stackoverflow suggested me several related questions and I found out that there was already two similar questions. That was stunning!

Then I started thinking how I would implement such function. How I would order questions by relatedness:

  1. Question that have higher number of words matchs with the new question
  2. If the number of matchs are the same, the order of words is considered
  3. Words that appears in the title has higher relevancy

That would be a simple workflow or a complex score algortithm? Some stemming to increase the recall, maybe? Is there some library the implements this function? What other aspects would you consider? Maybe Jeff could answer himself! How did you implemented this in Stackoverflow? :)

like image 401
Marcio Aguiar Avatar asked Sep 03 '08 20:09

Marcio Aguiar


2 Answers

One such way to implement such an algorithm would involve ranking the questions as per a heuristic function which assigns a 'relevance' weight factor using the following steps:

  1. Apply a noise filter to the 'New' question to remove words that are common across a large number of objects such as: 'the', 'and', 'or', etc.
  2. Get the number of words contained in the 'New' question which match the words the set of questions already posted on the website. [A]
  3. Get the number of tag matches between the words in the 'New' question and the available. [B]
  4. Compute the 'relevance weight' based on [A] and [B] as 'x[A] + y[B]', where x and y are weight multipliers (Assign a higher weight multiplier to [B] as tagging is more relevant than simple word search)
  5. Get the top 5 questions which have the highest 'relevance weight'.

The heuristic might require tweaking to get optimal results, but it should work.

like image 199
Pascal Avatar answered Nov 16 '22 08:11

Pascal


Your question seems similar to this one, which has some additional answers.

like image 27
robaker Avatar answered Nov 16 '22 07:11

robaker