Algorithm to find articles with similar text

Tags:

I have many articles in a database (with title,text), I'm looking for an algorithm to find the X most similar articles, something like Stack Overflow's "Related Questions" when you ask a question.

I tried googling for this but only found pages about other "similar text" issues, something like comparing every article with all the others and storing a similarity somewhere. SO does this in "real time" on text that I just typed.

How?

205

asked Oct 29 '08 14:10

Osama Al-Maadeed

2 Answers

Edit distance isn't a likely candidate, as it would be spelling/word-order dependent, and much more computationally expensive than Will is leading you to believe, considering the size and number of the documents you'd actually be interested in searching.

Something like Lucene is the way to go. You index all your documents, and then when you want to find documents similar to a given document, you turn your given document into a query, and search the index. Internally Lucene will be using tf-idf and an inverted index to make the whole process take an amount of time proportional to the number of documents that could possibly match, not the total number of documents in the collection.

answered Oct 10 '22 06:10

Jay Kominek

It depends upon your definition of similiar.

The edit-distance algorithm is the standard algorithm for (latin language) dictionary suggestions, and can work on whole texts. Two texts are similiar if they have basically the same words (eh letters) in the same order. So the following two book reviews would be fairly similiar:

1) "This is a great book"

2) "These are not great books"

(The number of letters to remove, insert, delete or alter to turn (2) into (1) is termed the 'edit distance'.)

To implement this you would want to visit every review programmatically. This is perhaps not as costly as it sounds, and if it is too costly you could do the comparisions as a background task and store the n-most-similiar in a database field itself.

Another approach is to understand something of the structure of (latin) languages. If you strip short (non-capitialised or quoted) words, and assign weights to words (or prefixes) that are common or unique, you can do a Bayesianesque comparision. The two following book reviews might be simiplied and found to be similiar:

3) "The french revolution was blah blah War and Peace blah blah France." -> France/French(2) Revolution(1) War(1) Peace(1) (note that a dictionary has been used to combine France and French)

4) "This book is blah blah a revolution in french cuisine." -> France(1) Revolution(1)

To implement this you would want to identify the 'keywords' in a review when it was created/updated, and to find similiar reviews use these keywords in the where-clause of a query (ideally 'full text' searching if the database supports it), with perhaps a post-processing of the results-set for scoring the candidates found.

Books also have categories - are thrillers set in France similiar to historical studies of France, and so on? Meta-data beyond title and text might be useful for keeping results relevant.

answered Oct 10 '22 07:10

Will

Related questions
                            
                                Remove characters from a string
                            
                                How to return all except last 2 characters of a string?
                            
                                How can I find whitespace in a String?
                            
                                C++ Remove new line from multiline string
                            
                                What is the difference between String and StringBuffer in Java?
                            
                                RegEx that will match the last occurrence of dot in a string
                            
                                Need to perform Wildcard (*,?, etc) search on a string using Regex
                            
                                How to find char in string and get all the indexes?
                            
                                What is a Java String's default initial value?
                            
                                formatting long numbers as strings in python
                            
                                How can I check if a string only contains letters in Python?
                            
                                Format string dynamically [duplicate]
                            
                                In C#: Add Quotes around string in a comma delimited list of strings
                            
                                First letter to upper case
                            
                                Python spacing and aligning strings
                            
                                Java Equivalent to .NET's String.Format
                            
                                How can I remove "\r\n" from a string in C#? Can I use a regular expression?
                            
                                Split First name and Last name using JavaScript
                            
                                How can I use the string value of a C# enum value in a case statement?
                            
                                How are text editors generally implemented?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Algorithm to find articles with similar text

Tags:

language-agnostic

string

text

algorithm

similarity

Osama Al-Maadeed

People also ask

2 Answers

Jay Kominek

Will

Recent Activity

Donate For Us