Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Text similarity Algorithms

I'm doing a Java project where I have to make a text similarity program. I want it to take 2 text documents, then compare them with each other and get the similarity of it. How similar they are to each other.

I'll later put an already database which can find the synonyms for the words and go through the text to see if one of the text document writers just changed the words to other synonyms while the text is exactly the same. Same thing with moving paragrafs up or down. Yes, as was it a plagarism program...

I want to hear from you people what kind of algoritms you would recommend.

I've found Levenstein and Cosine similarity by looking here and other places. Both of them seem to be mentioned a lot. Hamming distance is another my teacher told me about.

I got some questions related to those since I'm not really getting Wikipedia. Could someone explain those things to me?

Levenstein: This algorithm changed by sub, add and elimination the word and see how close it is to the other word in the text document. But how can that be used on a whole text file? I can see how it can be used on a word, but not on a sentence or a whole text document from one to another.

Cosine: It's measure of similarity between two vectors by measuring the cosine of the angle between them. What I don't understand here how two text can become 2 vectors and what about the words/sentence in those?

Hamming: This distance seems to work better than Levenstein but it's only on equal strings. How come it's important when 2 documents and even the sentences in those aren't two strings of equal length?

Wikipedia should make sense but it's not. I'm sorry if the questions sound too stupid but it's hanging me down and I think there's people in here who's quite capeable to explain it so even newbeginners into this field can get it.

Thanks for your time.

like image 580
N00programmer Avatar asked Apr 26 '11 17:04

N00programmer


People also ask

What are similarity algorithms?

Similarity algorithms compute the similarity of pairs of nodes based on their neighborhoods or their properties. Several similarity metrics can be used to compute a similarity score.

What are other text similarity techniques?

Such techniques are cosine similarity, Euclidean distance, Jaccard distance , word mover's distance. Cosine similarity is the technique that is being widely used for text similarity.

How do you measure similarity between two texts?

The simplest way to compute the similarity between two documents using word embeddings is to compute the document centroid vector. This is the vector that's the average of all the word vectors in the document.

What is text similarity in NLP?

Text Similarity In Natural Language Processing (NLP), the answer to “how two words/phrases/documents are similar to each other?” is a crucial topic for research and applications. Text similarity is to calculate how two words/phrases/documents are close to each other. That closeness may be lexical or in meaning.


3 Answers

Levenstein: in theory you could use it for a whole text file, but it's really not very suitable for the task. It's really intended for single words or (at most) a short phrase.

Cosine: You start by simply counting the unique words in each document. The answers to a previous question cover the computation once you've done that.

I've never used Hamming distance for this purpose, so I can't say much about it.

I would add TFIDF (Term Frequency * Inverted Document Frequency) to the list. It's fairly similar to Cosine distance, but 1) tends to do a better job on shorter documents, and 2) does a better job of taking into account what words are extremely common in an entire corpus rather than just the ones that happen to be common to two particular documents.

One final note: for any of these to produce useful results, you nearly need to screen out stop words before you try to compute the degree of similarity (though TFIDF seems to do better than the others if yo skip this). At least in my experience, it's extremely helpful to stem the words (remove suffixes) as well. When I've done it, I used Porter's stemmer algorithm.

For your purposes, you probably want to use what I've dubbed an inverted thesaurus, which lets you look up a word, and for each word substitute a single canonical word for that meaning. I tried this on one project, and didn't find it as useful as expected, but it sounds like for your project it would probably be considerably more useful.

like image 71
Jerry Coffin Avatar answered Oct 16 '22 05:10

Jerry Coffin


Basic idea of comparing similarity between two documents, which is a topic in information retrieval, is extracting some fingerprint and judge whether they share some information based on the fingerprint.

Just some hints, the Winnowing: Local Algorithms for Document Fingerprinting maybe a choice and a good start point to your problem.

like image 26
Summer_More_More_Tea Avatar answered Oct 16 '22 05:10

Summer_More_More_Tea


Consider the example on wikipedia for Levenshtein distance:

For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

   1. kitten → sitten (substitution of 's' for 'k')
   2. sitten → sittin (substitution of 'i' for 'e')
   3. sittin → sitting (insertion of 'g' at the end).

Now, replace "kitten" with "text from first paper", and "sitting" with "text from second paper".

Paper[] papers = getPapers();
for(int i = 0; i < papers.length - 1; i++) {
    for(int j = i + 1; j < papers.length; j++) {
        Paper first = papers[i];
        Paper second = papers[j];
        int dist = compareSimilarities(first.text,second.text);
        System.out.println(first.name + "'s paper compares to " + second.name + "'s paper with a similarity score of " + dist);
    }
}

Compare those results and peg the kids with the lowest distance scores.

In your compareSimilarities method, you could use any or all of the comparison algorithms. Another one you could incorporate in to the formula is "longest common substring" (which is a good method of finding plagerism.)

like image 45
corsiKa Avatar answered Oct 16 '22 05:10

corsiKa