Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best practice for determining the probability that 2 strings match

I need to write code to determine if 2 strings match when one of the strings may contain a small deviation from the second string e.g. "South Africa" v "South-Africa" or "England" v "Enlgand". At the moment, I am considering the following approach

  1. Determine the percentage of characters in string 1 that match those in string 2
  2. Determine the true probability of the match by combining the result of 1 with a comparison of the length of the 2 strings e.g. although all the characters in "SA" are found in "South Africa" it is not a very likely match since "SA" could be found in a range of other country names as well.

I would appreciate to hear what current best practice is for performing such string matching.

like image 430
RunLoop Avatar asked Nov 29 '22 11:11

RunLoop


1 Answers

You can look at Levenshtein distance. This is distance between two strings. The same strings have distance equal 0. Strings such as kitten and sitten have distance equal 1, and so on. Distance is measured by minimal number of simple operations that transform one string to another.

More information and algorithm in pseudo-code is given in link.

I also remember that this topic was mentioned in Game programming gems: volume 6: Article 1.6 Closest-String Matching Algorithm

like image 125
Dawid Avatar answered Dec 16 '22 23:12

Dawid