The related questions that appear after entering the title, and those that are in the right side bar when viewing a question seem to suggest very apt questions.
Stack Overflow only does a SQL search for it and uses no special algorithms, said Spolsky in a talk.
What algorithms exist to give good answers in such a case. How do U do database search in such a case? Make the title searchable and search on the keywords or search on tags and those questions with many votes on top?
A question can be 'closed' for five reasons - duplicate, off-topic, subjective, not a real question and too localized. In this work, we present the first study of 'closed' questions in Stack Overflow.
Closed questions cannot and should not be answered at all. They have to be reopened to get an answer, and they can only be reopened if the Question content itself is an appropriate question. Comments don't count. So edit the Question, and get it to a point where it should be opened, and then vote to reopen.
Closed questions don't allow any new answers to be added, but can still be edited and commented on. All it takes is one user (with the appropriate reputation level, natch) to decide… As an active Stack Overflow user, one of the abilities you'll gain at 3,000 reputation is the ability to close and reopen questions.
Post authors can delete their answers. But they can only delete their questions when there are no significantly upvoted answers to the question. Users with 10,000+ reputation can delete questions that have been closed for 48 hours, if they cast three (3) votes for deletion.
If you listen to the Stack Overflow podcast 32 (unfortunately the transcript doesn't have much in) you can hear Jeff Atwood say a little about how he does it.
It seems like the algorithm is something like:
More details about the full text search can be found here: http://msdn.microsoft.com/en-us/library/ms142571.aspx
This may be out of date by now - they were talking about moving to a better/faster full text search such as Lucene, and I vaguely remember Jeff saying in the podcast that this had been done.
The related questions sidebar will be building on the tags for each question (probably by ranking them based on tag overlap, so 5 tags in common > 4 tags in common etc).
The rest will be building on heuristics and algorithms suitable for natural language processing. These aren't normally very good in general purpose language, but most of them are VERY good once the vocabulary is reduced down to a single technical area such as programming.
Have a look at Porter stemming for a stemming algorithm if you are looking to get into "related" algorithms.
A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", "fish", and "fisher" to the root word, "fish".
Once you have processed a document and done stemming on it, you can index the stemmed words by count and then compare against other documents. This is the most basic approach to tackling this problem.
Also take care to ignore stop words like "the", "an", "of" etc.
This post will help you Is there an algorithm that tells the semantic similarity of two phrases
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With