I am trying to compare two lists of names and addresses to see find unique data. I can easily extract out all those are are exactly the same string in both lists, then I am left with names and addresses that are different but may be the same people. ie:
entry in list 1 Smith J Ph234567 34 Smith st
entry in list 2 Smith John Ph234567 34 Smith st
or
entry in list 1 Smith J Ph234567 34 Smith Rd
entry in list 2 Smith J Ph234567 34 Smith Road
I want to add a tag to entries that seem to be similar with each other like 80% match.
Nested Foreach loops don't work as they match every word, or letter (depending how you write it in the string with every other word or letter.
For loops don't work as one change J vrs John creates errors for every entry after the change.
I am writing it in vb.net but can also translate from C#
This kind of problem is generally solved by calculating the edit distance between the strings. Start with the Levenshtein distance for instance.
This will give you a score (the number of “edit operations” necessary to transform one string into the other). To convert this into a percent identity you need to normalise it by the length of the larger string (something along the lines of percent = (largerString.Length - editDistance) / largerString.Length
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With