Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I measure the similarity between 2 strings? [closed]

Given two strings text1 and text2:

public SOMEUSABLERETURNTYPE Compare(string text1, string text2) {      // DO SOMETHING HERE TO COMPARE } 

Examples:

  1. First String: StackOverflow

    Second String: StaqOverflow

    Return: Similarity is 91%

    The return can be in % or something like that.

  2. First String: The simple text test

    Second String: The complex text test

    Return: The values can be considered equal

Any ideas? What is the best way to do this?

like image 567
Zanoni Avatar asked Jun 23 '09 19:06

Zanoni


People also ask

Which distance measure is commonly used to estimate the similarity between two strings?

Levenshtein distance. A metric for measuring similarity between two strings. It is equal to the minimum number of operations required to transform a given string into another one.

How do you check if two strings are similar in Python?

The simplest way to check if two strings are equal in Python is to use the == operator. And if you are looking for the opposite, then != is what you need. That's it!

How do you find the similarity measure between two sets?

Typically, the Jaccard similarity coefficient (or index) is used to compare the similarity between two sets. For two sets, A and B , the Jaccard index is defined to be the ratio of the size of their intersection and the size of their union: J(A,B) = (A ∩ B) / (A ∪ B)

How do you measure similarity?

To calculate the similarity between two examples, you need to combine all the feature data for those two examples into a single numeric value. For instance, consider a shoe data set with only one feature: shoe size. You can quantify how similar two shoes are by calculating the difference between their sizes.


2 Answers

There are various different ways of doing this. Have a look at the Wikipedia "String similarity measures" page for links to other pages with algorithms.

I don't think any of those algorithms take sounds into consideration, however - so "staq overflow" would be as similar to "stack overflow" as "staw overflow" despite the first being more similar in terms of pronunciation.

I've just found another page which gives rather more options... in particular, the Soundex algorithm (Wikipedia) may be closer to what you're after.

like image 170
Jon Skeet Avatar answered Sep 21 '22 17:09

Jon Skeet


Levenshtein distance is probably what you're looking for.

like image 34
LiraNuna Avatar answered Sep 24 '22 17:09

LiraNuna