Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Computing similarity between two lists

EDIT: as everyone is getting confused, I want to simplify my question. I have two ordered lists. Now, I just want to compute how similar one list is to the other.

Eg,

1,7,4,5,8,9
1,7,5,4,9,6

What is a good measure of similarity between these two lists so that order is important. For example, we should penalize similarity as 4,5 is swapped in the two lists?

I have 2 systems. One state of the art system and one system that I implemented. Given a query, both systems return a ranked list of documents. Now, I want to compare the similarity between my system and the "state of the art system" in order to measure the correctness of my system. Please note that the order of documents is important as we are talking about a ranked system. Does anyone know of any measures that can help me find the similarity between these two lists.

like image 796
user1221572 Avatar asked Feb 20 '12 17:02

user1221572


People also ask

How do you find the similarity between two lists?

Using sum() ,zip() and len() This method first compares each element of the two lists and store those as summation of 1, which is then compared with the length of the other list. For this method, we have to first check if the lengths of both the lists are equal before performing this computation.

How do you find the similarities between two lists in Python?

We can club the Python sort() method with the == operator to compare two lists. Python sort() method is used to sort the input lists with a purpose that if the two input lists are equal, then the elements would reside at the same index positions.

How do you find the cosine similarity between two lists?

We use the below formula to compute the cosine similarity. where A and B are vectors: A.B is dot product of A and B: It is computed as sum of element-wise product of A and B. ||A|| is L2 norm of A: It is computed as square root of the sum of squares of elements of the vector A.

How do you find the common number in two lists in Python?

Method 2:Using Set's intersection property Convert the list to set by conversion. Use the intersection function to check if both sets have any elements in common. If they have many elements in common, then print the intersection of both sets.


2 Answers

The DCG [Discounted Cumulative Gain] and nDCG [normalized DCG] are usually a good measure for ranked lists.

It gives the full gain for relevant document if it is ranked first, and the gain decreases as rank decreases.

Using DCG/nDCG to evaluate the system compared to the SOA base line:

Note: If you set all results returned by "state of the art system" as relevant, then your system is identical to the state of the art if they recieved the same rank using DCG/nDCG.

Thus, a possible evaluation could be: DCG(your_system)/DCG(state_of_the_art_system)

To further enhance it, you can give a relevance grade [relevance will not be binary] - and will be determined according to how each document was ranked in the state of the art. For example rel_i = 1/log(1+i) for each document in the state of the art system.

If the value recieved by this evaluation function is close to 1: your system is very similar to the base line.

Example:

mySystem = [1,2,5,4,6,7] stateOfTheArt = [1,2,4,5,6,9] 

First you give score to each document, according to the state of the art system [using the formula from above]:

doc1 = 1.0 doc2 = 0.6309297535714574 doc3 = 0.0 doc4 = 0.5 doc5 = 0.43067655807339306 doc6 = 0.38685280723454163 doc7 = 0 doc8 = 0 doc9 = 0.3562071871080222 

Now you calculate DCG(stateOfTheArt), and use the relevance as stated above [note relevance is not binary here, and get DCG(stateOfTheArt)= 2.1100933062283396
Next, calculate it for your system using the same relecance weights and get: DCG(mySystem) = 1.9784040064803783

Thus, the evaluation is DCG(mySystem)/DCG(stateOfTheArt) = 1.9784040064803783 / 2.1100933062283396 = 0.9375907693942939

like image 95
amit Avatar answered Sep 25 '22 14:09

amit


Kendalls tau is the metric you want. It measures the number of pairwise inversions in the list. Spearman's foot rule does the same, but measures distance rather than inversion. They are both designed for the task at hand, measuring the difference in two rank-ordered lists.

like image 27
James Avatar answered Sep 23 '22 14:09

James