How to build a 'related questions' engine?

Question

One of our bigger sites has a section where users can send questions to the website owner which get evaluated personally by his staff. When the same question pops up very often they can add this particular question to the Faq.

In order to prevent them from receiving dozens of similar questions a day we would like to provide a feature similar to the 'Related questions' on this site (stack overflow).

What ways are there to build this kind of feature? I know that i should somehow evaluate the question and compare it to the questions in the faq but how does this comparison work? Are keywords extracted and if so how?

Might be worth mentioning this site is built on the LAMP stack thus these are the technologies available.

Thanks!

Ben · Accepted Answer

If you wanted to build something like this yourself from scratch, you'd use something called TF/IDF: Term Frequency / Inverse document frequency. That means, to simplify it enormously, you find words in the query that are uncommon in the corpus as a whole and find documents that have those words.

In other words, if someone enters a query with the words "I want to buy an elephant" in it, then of the words in the query, the word "elephant" is probably the least common word in your corpus. "Buy" is probably next. So you rank documents (in your case, previous queries) by how much they contain the word "elephant" and then how much they contain the word "buy". The words "I", "to" and "an" are probably in a stop-list, so you ignore them altogether. You rank each document (previous query, in your case) by how many matching words there are (weighting according to inverse document frequency -- i.e. high weight for uncommon words) and show the top few.

I've oversimplified, and you'd need to read up on this to get it right, but it's really not terribly complicated to implement in a simple way. The Wikipedia page might be a good place to start:

http://en.wikipedia.org/wiki/Tf%E2%80%93idf

Mark Byers · Answer

I don't know how Stack Overflow works, but I guess that it uses the tags to find related questions. For example, on this question the top few related questions all have the tag recommendation-engine. I would guess that the matches on rarer tags count for more than matches on common tags.

You might also want to look at term frequency–inverse document frequency.

How to build a 'related questions' engine?

Tags:

php

mysql

lamp

recommendation-engine

ChrisR

2 Answers

Ben

Mark Byers

Recent Activity

Donate For Us