Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithms for matching based on keywords intersection

Suppose we have buyers and sellers that are trying to find each other in a market. Buyers can tag their needs with keywords; sellers can do the same for what they are selling. I'm interested in finding algorithm(s) that rank-order sellers in terms of their relevance for a particular buyer on the basis of their two keyword sets.

Here is an example:

buyer_keywords = {"furry", "four legs", "likes catnip", "has claws"} 

and then we have two potential sellers that we need to rank order in terms of their relevance:

seller_keywords[1] = {"furry", "four legs", "arctic circle", "white"}
seller_keywords[2] = {"likes catnip", "furry", 
                      "hates mice", "yarn-lover", "whiskers"}

If we just use the intersection of keywords, we do not get much discrimination: both intersect on 2 keywords. If we divide the intersection count by the size of the set union, seller 2 actually does worse because of the greater number of keywords. This would seem to introduce an automatic penalty for any method not correcting keyword set size (and we definitely don't want to penalize adding keywords).

To put a bit more structure on the problem, suppose we have some truthful measure of intensity of keyword attributes (which have to sum to 1 for each seller), e.g.,:

seller_keywords[1] = {"furry":.05, 
                      "four legs":.05, 
                      "arctic circle":.8, 
                      "white":.1}

seller_keywords[2] = {"likes catnip":.5, 
                      "furry":.4, 
                      "hates mice":.02, 
                      "yarn-lover":.02, 
                      "whiskers":.06}

Now we could sum up the value of hits: so now Seller 1 only gets a score of .1, while Seller 2 gets a score of .9. So far, so good, but now we might get a third seller with a very limited, non-descriptive keyword set:

seller_keywords[3] = {"furry":1}

This catapults them to the top for any hit on their sole keyword, which isn't good.

Anyway, my guess (and hope) is that this is a fairly general problem and that there exist different algorithmic solutions with known strengths and limitations. This is probably something covered in CS101, so I think a good answer to this question might just be a link to the relevant references.

like image 897
John Horton Avatar asked Feb 28 '11 13:02

John Horton


1 Answers

I think you're looking to use cosine similarity; it's a basic technique that gets you pretty far as a first hack. Intuitively, you create a vector where every tag you know about has a particular index:

terms[0] --> aardvark
terms[1] --> anteater
...
terms[N] --> zuckerberg

Then you create vectors in this space for each person:

person1[0] = 0     # this person doesn't care about aardvarks
person1[1] = 0.05  # this person cares a bit about anteaters
...
person1[N] = 0

Each person is now a vector in this N-dimensional space. You can then use cosine similarity to calculate similarity between pairs of them. Calculationally, this is basically the same of asking for the angle between the two vectors. You want a cosine close to 1, which means that the vectors are roughly collinear -- that they have similar values for most of the dimensions.

To improve this metric, you may want to use tf-idf weighting on the elements in your vector. Tf-idf will downplay the importance of popular terms (e.g, 'iPhone') and promote the importance of unpopular terms that this person seems particularly associated with.

Combining tf-idf weighting and cosine similarity does well for most applications like this.

like image 172
Michael Bernstein Avatar answered Sep 26 '22 02:09

Michael Bernstein