Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What algorithm does StackOverflow use for finding similar questions? [duplicate]

Tags:

algorithm

I need to create a help desk for customers in a website I'm building and I love the way StackOverflow finds similar questions. Does anyone know what algorithm the site uses and can you provide any references where I can find one?

like image 697
DreamWave Avatar asked Apr 24 '13 15:04

DreamWave


People also ask

How do I mark a question as duplicate in StackOverflow?

Marking a question as duplicate is part of the question-closing procedure, except that when a question is closed as duplicate, the title is appended with "[duplicate]" rather than "[closed]". Moderators and anyone with 3000 reputation may vote to close a question as a duplicate.

What is duplicate in StackOverflow?

Abstract: Duplicate questions on Stack Overflow are questions that are flagged as being conceptually equivalent to a previously posted question.

What is the most viewed question on stack overflow?

What's missing? Some observations: The top Stack Overflow question of all time — with more than 7 million views since its creation 9 years ago — is not even a programming question: “How do I undo the most recent commits in Git”.

What is an algorithm StackOverflow?

An algorithm is a sequence of well-defined steps that defines an abstract solution to a problem. Use this tag when your issue is related to algorithm design.


2 Answers

There is a whole branch of Machine Learning called clustering (a type of unsupervised learning) that deals with such types of problems.

The question becomes a part of a cluster, and other questions in the same cluster (probably in the order of similarity measure distance) are displayed as similar questions.

There are various features that it can use for clustering, some of which may be:

  • Tags
  • Words in heading
  • Words in the text (lesser weight than heading)
  • Links to other questions/webpages.

and so on.

There may be other formulated features using techniques like text summarization, sentiment analysis, etc., that are used in these kind of problems. Which features are good for which problem depends on the problem.

Other areas where you see these algorithms in action are:

  • Youtube
  • Wikipedia
  • IMDB

and the list continues to infinity.

So what can you do about your problem?

There is no one answer for it. It all depends on your data, and target query. But still, you can

  • Learn feature engineering aspects of machine learning.
  • Learn about clustering.

(There are many online courses for these.)

Or

  • Hire a person who knows this stuff.
like image 53
Sailesh Avatar answered Oct 12 '22 05:10

Sailesh


Most likley a weighted match on tags and perhaps a match() or equivilent full text weighted search on title.

Its probably got details of it in meta somewhere or FAQ

like image 43
Dave Avatar answered Oct 12 '22 07:10

Dave