Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Text similarity algorithm

I have two subtitles files. I need a function that tells whether they represent the same text, or the similar text

Sometimes there are comments like "The wind is blowing... the music is playing" in one file only. But 80% percent of the contents will be the same. The function must return TRUE (files represent the same text). And sometimes there are misspellings like 1 instead of l (one - L ) as here: She 1eft the baggage. Of course, it means function must return TRUE.

My comments:
The function should return percentage of the similarity of texts - AGREE

"all the people were happy" and "all the people were not happy" - here that'd be considered as a misspelling, so that'd be considered the same text. To be exact, the percentage the function returns will be lower, but high enough to say the phrases are similar

Do consider whether you want to apply Levenshtein on a whole file or just a search string - not sure about Levenshtein, but the algorithm must be applied to the file as a whole. It'll be a very long string, though.

like image 434
EugeneP Avatar asked Feb 24 '10 11:02

EugeneP


People also ask

How do you measure similarity between two texts?

Similarity is calculated by measuring the cosine of the angle between two vectors [8]. Because of the size of the document, even if two similar documents are far away from Euclid, it is more advantageous to use the cosine distance to measure similarity.

What is text similarity in NLP?

Text Similarity In Natural Language Processing (NLP), the answer to “how two words/phrases/documents are similar to each other?” is a crucial topic for research and applications. Text similarity is to calculate how two words/phrases/documents are close to each other. That closeness may be lexical or in meaning.

What are other text similarity techniques?

Such techniques are cosine similarity, Euclidean distance, Jaccard distance , word mover's distance. Cosine similarity is the technique that is being widely used for text similarity.


1 Answers

Levenshtein algorithm: http://en.wikipedia.org/wiki/Levenshtein_distance

Anything other than a result of zero means the text are not "identical". "Similar" is a measure of how far/near they are. Result is an integer.

like image 64
bcosca Avatar answered Oct 05 '22 15:10

bcosca