What tried and true algorithms for suggesting related articles are out there?

Tags:

Pretty common situation, I'd wager. You have a blog or news site and you have plenty of articles or blags or whatever you call them, and you want to, at the bottom of each, suggest others that seem to be related.

Let's assume very little metadata about each item. That is, no tags, categories. Treat as one big blob of text, including the title and author name.

How do you go about finding the possibly related documents?

I'm rather interested in the actual algorithm, not ready solutions, although I'd be ok with taking a look at something implemented in ruby or python, or relying on mysql or pgsql.

edit: the current answer is pretty good but I'd like to see more. Maybe some really bare example code for a thing or two.

462

asked Aug 10 '09 12:08

kch

1 Answers

This is a pretty big topic -- in addition to the answers people come up with here, I recommend tracking down the syllabi for a couple of information retrieval classes and checking out the textbooks and papers assigned for them. That said, here's a brief overview from my own grad-school days:

The simplest approach is called a bag of words. Each document is reduced to a sparse vector of {word: wordcount} pairs, and you can throw a NaiveBayes (or some other) classifier at the set of vectors that represents your set of documents, or compute similarity scores between each bag and every other bag (this is called k-nearest-neighbour classification). KNN is fast for lookup, but requires O(n^2) storage for the score matrix; however, for a blog, n isn't very large. For something the size of a large newspaper, KNN rapidly becomes impractical, so an on-the-fly classification algorithm is sometimes better. In that case, you might consider a ranking support vector machine. SVMs are neat because they don't constrain you to linear similarity measures, and are still quite fast.

Stemming is a common preprocessing step for bag-of-words techniques; this involves reducing morphologically related words, such as "cat" and "cats", "Bob" and "Bob's", or "similar" and "similarly", down to their roots before computing the bag of words. There are a bunch of different stemming algorithms out there; the Wikipedia page has links to several implementations.

If bag-of-words similarity isn't good enough, you can abstract it up a layer to bag-of-N-grams similarity, where you create the vector that represents a document based on pairs or triples of words. (You can use 4-tuples or even larger tuples, but in practice this doesn't help much.) This has the disadvantage of producing much larger vectors, and classification will accordingly take more work, but the matches you get will be much closer syntactically. OTOH, you probably don't need this for semantic similarity; it's better for stuff like plagiarism detection. Chunking, or reducing a document down to lightweight parse trees, can also be used (there are classification algorithms for trees), but this is more useful for things like the authorship problem ("given a document of unknown origin, who wrote it?").

Perhaps more useful for your use case is concept mining, which involves mapping words to concepts (using a thesaurus such as WordNet), then classifying documents based on similarity between concepts used. This often ends up being more efficient than word-based similarity classification, since the mapping from words to concepts is reductive, but the preprocessing step can be rather time-consuming.

Finally, there's discourse parsing, which involves parsing documents for their semantic structure; you can run similarity classifiers on discourse trees the same way you can on chunked documents.

These pretty much all involve generating metadata from unstructured text; doing direct comparisons between raw blocks of text is intractable, so people preprocess documents into metadata first.

174

answered Oct 06 '22 01:10

Meredith L. Patterson

Related questions
                            
                                Is there any free DBF file converter? [closed]
                            
                                Is there an official name for the many-to-many relationship table in a database schema?
                            
                                How to select rows from data.frame with 2 conditions
                            
                                Logging facilities and Qt
                            
                                Lemmatization java [closed]
                            
                                Strange problem comparing floats in objective-C
                            
                                Can't invoke git-svn from command line
                            
                                Programmatically assigning a color to a row in DataGrid
                            
                                What is safer? Should I send an email with a URL that expires to users to reset their password or should I email a newly generated password?
                            
                                what is relation between GC, Finalize() and Dispose?
                            
                                Why is break required after yield return in a switch statement?
                            
                                Using Powershell's bitwise operators

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What tried and true algorithms for suggesting related articles are out there?

Tags:

kch

People also ask

1 Answers

Meredith L. Patterson

Recent Activity

Donate For Us