Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to best match item pairs given similarity score

I am trying to match two lists of products by name.

Products come from different websites, and theirs names may vary from one website to the other in many subtle ways, e.g. "iPhone 128 GB" vs "Apple iPhone 128GB".

The product lists intersect, but are not equal and one is not a superset of the other; i.e. some products from list A are not in list B, and vice versa.

Given an algorithm that compares two strings (product names) and returns a similarity score between 0 and 1 (I already have a satisfactory implementation here), I'm looking for an algorithm that performs an optimal match of list A to list B.

In other words, I think I'm looking for an algorithm that maximizes the sum of all similary scores in the matches.

Note that a product from one list must be matched to at most one product from the other list.

My initial idea

  • for each product in A, get the similary with each product in B, and retain the product that yields the highest score, provided that it exceeds a certain threshold, such as 0.75. Match these products.
  • if the product with the highest score was already matched to another product in A earlier in the loop, take the second-to-highest, provided that it exceeds the threshold above. Match to this one instead.

etc.

My worry with this native implementation is that if there's a better match later in the loop, but the product from B has already been assigned to another product from A in a previous iteration, the matching is not optimal.

An improved version

To ensure that the product is matched to its highest similarity counterpart, I thought of the following implementation:

  • pre-compute the similarity scores for all A-B pairs
  • discard the similarities lower than the threshold used above
  • order by similarity, highest first
  • for each pair, if neither product A nor product B has already been matched, match these products.

This algorithm should optimally match product pairs, ensuring that each pair got the highest similarity.

My worry is that it's very compute- and memory-intensive: say I have 5,000 products in both lists, that is 25,000,000 similarity scores to pre-compute and potentially store in memory (or database); in reality it will be lower due to the minimum required threshold, but it can still get very large and is still CPU intensive.

Did I miss something?

Is there a more efficient algorithm that gives the same output as this improved version?

like image 760
BenMorel Avatar asked Sep 13 '25 19:09

BenMorel


2 Answers

Your model could be reformulated in graph terms: consider a complete weighted bipartite graph, where vertices of the first part are names from list A, vertices of the second part are names from list B and edges are weighted with precomputed scores of similarity.

enter image description here

Now your problem looks really close to the dense Assignment_problem, which optimal solution can be found with Hungarian algorithm (O(n³) complexity).

If optimal solution is not your final goal and some good approximations to optimum can also satisfy your requirements, try heuristic algorithms for assignment problem, here is another topic with a brief overview of them.

like image 171
gimme_danger Avatar answered Sep 16 '25 20:09

gimme_danger


Your second algorithm should provide a decent output, but it's not optimal. Check the following case:

Set0 Set1 
A    C
B    D

Similarities:
A-C = 900
A-D = 850
B-C = 850
B-D = 0

Your algorithm's output: [(A,C), (B,D)]. Value 900.
Optimal output: [(A,D), (B,C)]. Value 1700.  

The problem you are working with is exactly the Assigment Problem, which is "finding, in a weighted bipartite graph, a matching in which the sum of weights of the edges is as large as possible". You can find many ways to optimally and efficiently solve this problem.

like image 26
chn Avatar answered Sep 16 '25 19:09

chn