Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Effective way to calculate a similarity percentage between data sets

I am currently working with User objects -- each of which have many Goal objects. The Goal objects are not User specific, that is, Users can share the same Goal. I am attempting to fashion a way to calculate a "similarity percentage" between two Users... (i.e., taking into account how many Goals they share as well as how many Goals they do not share) Does anyone have experience with this type of situation? I am using Grails with Mysql if that is helpful.

Thanks

like image 729
RyanLynch Avatar asked Apr 24 '10 23:04

RyanLynch


People also ask

How do you measure the similarity between two sets of data?

The Sørensen–Dice distance is a statistical metric used to measure the similarity between sets of data. It is defined as two times the size of the intersection of P and Q, divided by the sum of elements in each data set P and Q.

How do you calculate similarity score?

The similarity score is the dot product of A and B divided by the squared magnitudes of A and B minus the dot product.

What is similarity algorithm?

Similarity algorithms compute the similarity of pairs of nodes based on their neighborhoods or their properties. Several similarity metrics can be used to compute a similarity score.

How do you find the percentage of a set of data?

To calculate a percentage, you need a fraction. Convert the fraction to decimal form by dividing the numerator by the denominator, multiply by 100, and there's your percentage. When you compile a data set, each value (x) can be expressed as a percentage of the entire set.


1 Answers

The standard way to do this is the Jaccard similarity. If A is the set of goals of the first user and B is the set of goals of the second user, the Jaccard similarity is:

#(A intersect B)/#(A union B)

This is the number of goals they share divided by the total number of votes the two have together (counting goals that they share only once). So if the first user has goals A={1,2,3} and the second user has goals B={2,4} then it is this:

A intersect B = {2}
A union B = {1,2,3,4}

#(A intersect B)/#(A union B) = 1/4

The Jaccard similarity is always between 0 (they share no goals) and 1 (they have the same goals), so you can get a percentage by multiplying it by 100.

http://en.wikipedia.org/wiki/Jaccard_index

like image 167
Jules Avatar answered Nov 11 '22 07:11

Jules