What are some good methods to find the "relatedness" of two bodies of text?

Question

Here's the problem -- I have a few thousand small text snippets, anywhere from a few words to a few sentences - the largest snippet is about 2k on disk. I want to be able to compare each to each, and calculate a relatedness factor so that I can show users related information.

What are some good ways to do this? Are there known algorithms for doing this that are any good, are there any GPL'd solutions, etc?

I don't need this to run in realtime, as I can precalculate everything. I'm more concerned with getting good results than runtime.

I just thought I would ask the Stack Overflow community before going and writing my own thing. There HAVE to be people out there who have found good solutions to this before.

jjclarkson · Accepted Answer

These articles on semantic relatedness and semantic similarity may be helpful. And this SO question about Latent Semantic Analysis.

You could also look into Soundex for words that "sound alike" phonetically.

What are some good methods to find the "relatedness" of two bodies of text?

Tags:

comparison

full-text-search

string-comparison

information-retrieval

Matt

1 Answers

jjclarkson

Recent Activity

Donate For Us

What are some good methods to find the "relatedness" of two bodies of text?

Tags:

comparison

full-text-search

string-comparison

information-retrieval

Matt

1 Answers

jjclarkson

Related questions

Recent Activity

Donate For Us