Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to determine a strings dna for likeness to another

I am hoping I am wording this correctly to get across what I am looking for.

I need to compare two pieces of text. If the two strings are alike I would like to get scores that are very alike if the strings are very different i need scores that are very different.

If i take a md5 hash of an email and change one character the hash changes dramatically I want something to not change too much. I need to compare how alike two pieces of content are without storing the string.

Update: I am looking now at combining some ideas from the various links people have provided. Ideally I would of liked a single input function to create my score so I am looking at using a reference string to always compare my input to. I am also looking at taking asci characters and suming these up. Still reading all the links provided.

like image 570
Paul Whelan Avatar asked Apr 28 '09 12:04

Paul Whelan


2 Answers

What you're looking for is a LCS algorithm (see also Levenshtein distance). You may also try Soundex or some other phonetic algorithm.

like image 118
Anton Gogolev Avatar answered Nov 16 '22 02:11

Anton Gogolev


Reading your comments, it sounds like you are actually trying to compare entire documents, each containing many words.

This is done successfully in information retrieval systems by treating documents as N-dimensional points in space. Each word in the language is an axis. The distance along the axis is determined by the number of times that word appears in the document. Similar documents are then "near" each other in space.

This way, the whole document doesn't need to be stored, just its word counts. And usually the most common words in the language are not counted at all.

like image 41
erickson Avatar answered Nov 16 '22 02:11

erickson