Assume two sets of strings:
[ "Mr. Jones", "O'Flaherty", "Bob", "Rob Jenkins" ]
[ "Maxwell O'Flaherty", "Robert Jenkins", "Mrs. Smith" ]
It is obvious that those two sets have Maxwell O'Flaherty and Robert Jenkins in common.
Is there any algorithm that will allow us to do such matching programatically? I am thinking of writing something that will go through each element in an array of strings and try to find any substring that is unique and not contained in any other element in either of the sets and then use that as a kind of hash of each element to match up the two sets.
You may find the Levenshtein distance useful. If you are doing a lot of this where it is unclear how accurate the information is there are libraries for string disambiguation. (It's not "obvious" that Rob and Robert are identical - indeed the first one could be Robin.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With