Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are some good methods to find the "relatedness" of two bodies of text?

Here's the problem -- I have a few thousand small text snippets, anywhere from a few words to a few sentences - the largest snippet is about 2k on disk. I want to be able to compare each to each, and calculate a relatedness factor so that I can show users related information.

What are some good ways to do this? Are there known algorithms for doing this that are any good, are there any GPL'd solutions, etc?

I don't need this to run in realtime, as I can precalculate everything. I'm more concerned with getting good results than runtime.

I just thought I would ask the Stack Overflow community before going and writing my own thing. There HAVE to be people out there who have found good solutions to this before.

like image 892
Matt Avatar asked Aug 31 '09 18:08

Matt


1 Answers

These articles on semantic relatedness and semantic similarity may be helpful. And this SO question about Latent Semantic Analysis.

You could also look into Soundex for words that "sound alike" phonetically.

like image 71
jjclarkson Avatar answered Oct 09 '22 23:10

jjclarkson