Text similarity algorithm

Tags:

I have two subtitles files. I need a function that tells whether they represent the same text, or the similar text

Sometimes there are comments like "The wind is blowing... the music is playing" in one file only. But 80% percent of the contents will be the same. The function must return TRUE (files represent the same text). And sometimes there are misspellings like 1 instead of l (one - L ) as here: She 1eft the baggage. Of course, it means function must return TRUE.

My comments:
The function should return percentage of the similarity of texts - AGREE

"all the people were happy" and "all the people were not happy" - here that'd be considered as a misspelling, so that'd be considered the same text. To be exact, the percentage the function returns will be lower, but high enough to say the phrases are similar

Do consider whether you want to apply Levenshtein on a whole file or just a search string - not sure about Levenshtein, but the algorithm must be applied to the file as a whole. It'll be a very long string, though.

434

asked Feb 24 '10 11:02

EugeneP

1 Answers

Levenshtein algorithm: http://en.wikipedia.org/wiki/Levenshtein_distance

Anything other than a result of zero means the text are not "identical". "Similar" is a measure of how far/near they are. Result is an integer.

answered Oct 05 '22 15:10

bcosca

Related questions
                            
                                Multiple streams from a single master topic
                            
                                How to delete all temp files which created by createTempFile when exit an App in android?
                            
                                Registering JacksonJsonProvider with ObjectMapper + JavaTimeModule to Jersey 2 Client
                            
                                Does log.debug decrease performance
                            
                                Why Kotlin receives such an UndeclaredThrowableException rather than a ParseException?
                            
                                Spring: automatic rollback on checked exceptions
                            
                                SpringBoot applications keeps rebooting all the time (restart loop) - spring.devtools
                            
                                JDK9 and maven-jar-plugin
                            
                                Put value into map if not null
                            
                                Native memory allocation (mmap) failed to map
                            
                                How to document attributes in Kotlin data class?
                            
                                How do I deal with null and duplicate values in a Java 8 Comparator?
                            
                                Java 8 Optional cannot be applied to interface
                            
                                how to fix 'Disable XML external entity (XXE) processing' vulnerabilities in java
                            
                                VSCode Maven error `The compiler compliance specified is 1.7 but a JRE 13 is used`
                            
                                How to convert a for-loop to find the first occurrence to Java streams?
                            
                                What is the point of a “sealed interface” in Java?
                            
                                Displaying fancy equations with Java [closed]
                            
                                How to update SWT GUI from another thread in Java
                            
                                Doesn't the fact that Go and Java use User space thread mean that you can't really take advantage of multiple core?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Text similarity algorithm

Tags:

java

text

nlp

levenshtein-distance

similarity

EugeneP

People also ask

1 Answers

bcosca

Recent Activity

Donate For Us