Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match 2 lists of strings by ressemblance

Problem

I have 2 lists of strings. I want to find the best matching pairs from my lists.

For example, I have those 2 lists:

list1 = {"a1","b1","c1"}
list2 = {"a2","b2","c2"}

I want to get the following results:

results = {{"a1,"a2"}, {"b1,"b2"}, {"c1,"c2"}}

Additional Info

To compare 2 strings together, I would like to use something similar to the Levenshtein distance. For example, when I compare "a1" with "a2", it gives me a shorter distance than "a1" with "b2", so "a1"+"a2" would be considered a better match.

I gets complicated when different pairs gets the same distance results. You can't just take minimum distance for a specific item in list1, because another item in list1 could obtain the same distance with the same item in list2.

Question

Do you have suggestions of algorithms for that?

Where I am right now

You better not look at my finding first so you don't get influenced by my work.

I calculate the Levenshtein distance for each possible pair of string and store the results in a 2-dimension array. Then I build a single dimension array where each element has:

  • the pair (the i,j indexes in my 2-dimension array)
  • the distance

Then I sort this array by using distance element.

Finally, I go through the sorted array and resolve the items with a common distance together (all distance==0 first, then all distance==1, etc...). Every time, I resolve an element, I mark it in my 2D array, so I can quickly skip the resolved items in my sorted array.

I think I can better than this solution. It may not the most efficient in time and space.

like image 574
decasteljau Avatar asked Nov 29 '25 16:11

decasteljau


2 Answers

Once you have established the metric you want to use to keep track of the "distance" between two strings, be it the Levenshtein distance or another one, you can use the Hungarian algorithm to solve your problem.

I personally have never implement it, but Wikipedia includes several links that might be of help.

like image 64
abeln Avatar answered Dec 01 '25 09:12

abeln


My suggestion for a possible optimization to this:

I calculate the Levenshtein distance for each possible pair of string and store the results in a 2-dimension array.

Is that you can avoid computing the distance for every possible pair of the string by considering their lengths. Because let's say:

1. if the pair is e.g. "ab", and "cdefg"
2. and you know that there's another string that has similar length with "ab" e.g. "xy"

Then you shouldn't need to calculate the distance between "ab" and "cdefg". Because the minimum distance you can get between strings of those lengths is 3, whereas the maximum distance between two strings of equal lengths ("ab" and "xy" as in the example) will be 2.

You can do this by using a smarter data structure that keeps track of length of strings e.g. unordered_map<int, vector<string> > in C++0x or tr1 C++.

like image 44
ryaner Avatar answered Dec 01 '25 07:12

ryaner



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!