I'm doing a Java project where I have to make a text similarity program. I want it to take 2 text documents, then compare them with each other and get the similarity of it. How similar they are to each other. I'll later put an already database which can find the synonyms for the words and go through the text to see if one of the text document writers just changed the words to other synonyms while the text is exactly the same. Same thing with moving paragrafs up or down. Yes, as was it a plagarism program... I want to hear from you people what kind of algoritms you would recommend. I've found Levenstein and Cosine similarity by looking here and other places. Both of them seem to be mentioned a lot. Hamming distance is another my teacher told me about. I got some questions related to those since I'm not really getting Wikipedia. Could someone explain those things to me? Levenstein: This algorithm changed by sub, add and elimination the word and see how close it is to the other word in the text document. But how can that be used on a whole text file? I can see how it can be used on a word, but not on a sentence or a whole text document from one to another. Cosine: It's measure of similarity between two vectors by measuring the cosine of the angle between them. What I don't understand here how two text can become 2 vectors and what about the words/sentence in those? Hamming: This distance seems to work better than Levenstein but it's only on equal strings. How come it's important when 2 documents and even the sentences in those aren't two strings of equal length? Wikipedia should make sense but it's not. I'm sorry if the questions sound too stupid but it's hanging me down and I think there's people in here who's quite capeable to explain it so even newbeginners into this field can get it. Thanks for your time.

Consider the example on wikipedia for Levenshtein distance: <pre class="prettyprint"><code>For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits: 1. kitten → sitten (substitution of 's' for 'k') 2. sitten → sittin (substitution of 'i' for 'e') 3. sittin → sitting (insertion of 'g' at the end). </code></pre> Now, replace "kitten" with "text from first paper", and "sitting" with "text from second paper". <pre class="prettyprint"><code>Paper[] papers = getPapers(); for(int i = 0; i < papers.length - 1; i++) { for(int j = i + 1; j < papers.length; j++) { Paper first = papers[i]; Paper second = papers[j]; int dist = compareSimilarities(first.text,second.text); System.out.println(first.name + "'s paper compares to " + second.name + "'s paper with a similarity score of " + dist); } } </code></pre> Compare those results and peg the kids with the lowest distance scores. In your <code>compareSimilarities</code> method, you could use any or all of the comparison algorithms. Another one you could incorporate in to the formula is "longest common substring" (which is a good method of finding plagerism.)

Text similarity Algorithms

Tags:

java

text

algorithm

similarity

I'm doing a Java project where I have to make a text similarity program. I want it to take 2 text documents, then compare them with each other and get the similarity of it. How similar they are to each other.

I'll later put an already database which can find the synonyms for the words and go through the text to see if one of the text document writers just changed the words to other synonyms while the text is exactly the same. Same thing with moving paragrafs up or down. Yes, as was it a plagarism program...

I want to hear from you people what kind of algoritms you would recommend.

I've found Levenstein and Cosine similarity by looking here and other places. Both of them seem to be mentioned a lot. Hamming distance is another my teacher told me about.

I got some questions related to those since I'm not really getting Wikipedia. Could someone explain those things to me?

Levenstein: This algorithm changed by sub, add and elimination the word and see how close it is to the other word in the text document. But how can that be used on a whole text file? I can see how it can be used on a word, but not on a sentence or a whole text document from one to another.

Cosine: It's measure of similarity between two vectors by measuring the cosine of the angle between them. What I don't understand here how two text can become 2 vectors and what about the words/sentence in those?

Hamming: This distance seems to work better than Levenstein but it's only on equal strings. How come it's important when 2 documents and even the sentences in those aren't two strings of equal length?

Wikipedia should make sense but it's not. I'm sorry if the questions sound too stupid but it's hanging me down and I think there's people in here who's quite capeable to explain it so even newbeginners into this field can get it.

Thanks for your time.

580

asked Apr 26 '11 17:04

N00programmer

3 Answers

Levenstein: in theory you could use it for a whole text file, but it's really not very suitable for the task. It's really intended for single words or (at most) a short phrase.

Cosine: You start by simply counting the unique words in each document. The answers to a previous question cover the computation once you've done that.

I've never used Hamming distance for this purpose, so I can't say much about it.

I would add TFIDF (Term Frequency * Inverted Document Frequency) to the list. It's fairly similar to Cosine distance, but 1) tends to do a better job on shorter documents, and 2) does a better job of taking into account what words are extremely common in an entire corpus rather than just the ones that happen to be common to two particular documents.

One final note: for any of these to produce useful results, you nearly need to screen out stop words before you try to compute the degree of similarity (though TFIDF seems to do better than the others if yo skip this). At least in my experience, it's extremely helpful to stem the words (remove suffixes) as well. When I've done it, I used Porter's stemmer algorithm.

For your purposes, you probably want to use what I've dubbed an inverted thesaurus, which lets you look up a word, and for each word substitute a single canonical word for that meaning. I tried this on one project, and didn't find it as useful as expected, but it sounds like for your project it would probably be considerably more useful.

answered Oct 16 '22 05:10

Jerry Coffin

Basic idea of comparing similarity between two documents, which is a topic in information retrieval, is extracting some fingerprint and judge whether they share some information based on the fingerprint.

Just some hints, the Winnowing: Local Algorithms for Document Fingerprinting maybe a choice and a good start point to your problem.

answered Oct 16 '22 05:10

Summer_More_More_Tea

Consider the example on wikipedia for Levenshtein distance:

For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

   1. kitten → sitten (substitution of 's' for 'k')
   2. sitten → sittin (substitution of 'i' for 'e')
   3. sittin → sitting (insertion of 'g' at the end).

Now, replace "kitten" with "text from first paper", and "sitting" with "text from second paper".

Paper[] papers = getPapers();
for(int i = 0; i < papers.length - 1; i++) {
    for(int j = i + 1; j < papers.length; j++) {
        Paper first = papers[i];
        Paper second = papers[j];
        int dist = compareSimilarities(first.text,second.text);
        System.out.println(first.name + "'s paper compares to " + second.name + "'s paper with a similarity score of " + dist);
    }
}

Compare those results and peg the kids with the lowest distance scores.

In your compareSimilarities method, you could use any or all of the comparison algorithms. Another one you could incorporate in to the formula is "longest common substring" (which is a good method of finding plagerism.)

answered Oct 16 '22 05:10

corsiKa

Related questions
                            
                                Understanding the Surviving Generations Metric
                            
                                Carriage return issue decoding Base64 from Java and sending to browser
                            
                                Can I delete downloaded JNLP file (Java web start/JWS) from Java application?
                            
                                Is there a Java code coverage tool that records the number of times lines and methods are called?
                            
                                Filter for android ListView - Space Character
                            
                                Embedded NIO http client/server in Java [closed]
                            
                                Can't Save "Is a Tomcat Project" setting in Eclipse Project Properties
                            
                                Validating JAX-RS (Apache Wink) resources with JSR303?
                            
                                OOo: UNO (Java) TrackedChanges: How to accept (or hide) Tracked Changes when Document is hidden?
                            
                                Read files from a folder inside a jar file
                            
                                Add trustStore for client authentication [duplicate]
                            
                                OutOfMemoryError - how to kill the JVM after the hprof created
                            
                                Write HQL clause using Hibernate Criteria API
                            
                                Sikuli actions inside a region
                            
                                Hibernate exception; found two representations of the same collection
                            
                                Why does this compile under Java 7 but not under Java 8?
                            
                                Android Development: How To Use onKeyUp?
                            
                                Differences between RxJava1 and RxJava2
                            
                                Using Spring Boot's ErrorController and Spring's ResponseEntityExceptionHandler correctly
                            
                                UnsupportedOperationException when using iterator.remove()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Text similarity Algorithms

Tags:

java

text

algorithm

similarity

N00programmer

People also ask

3 Answers

Jerry Coffin

Summer_More_More_Tea

corsiKa

Recent Activity

Donate For Us