Computing similarity between two lists

Tags:

EDIT: as everyone is getting confused, I want to simplify my question. I have two ordered lists. Now, I just want to compute how similar one list is to the other.

Eg,

1,7,4,5,8,9
1,7,5,4,9,6

What is a good measure of similarity between these two lists so that order is important. For example, we should penalize similarity as 4,5 is swapped in the two lists?

I have 2 systems. One state of the art system and one system that I implemented. Given a query, both systems return a ranked list of documents. Now, I want to compare the similarity between my system and the "state of the art system" in order to measure the correctness of my system. Please note that the order of documents is important as we are talking about a ranked system. Does anyone know of any measures that can help me find the similarity between these two lists.

796

asked Feb 20 '12 17:02

user1221572

2 Answers

The DCG [Discounted Cumulative Gain] and nDCG [normalized DCG] are usually a good measure for ranked lists.

It gives the full gain for relevant document if it is ranked first, and the gain decreases as rank decreases.

Using DCG/nDCG to evaluate the system compared to the SOA base line:

Note: If you set all results returned by "state of the art system" as relevant, then your system is identical to the state of the art if they recieved the same rank using DCG/nDCG.

Thus, a possible evaluation could be: DCG(your_system)/DCG(state_of_the_art_system)

To further enhance it, you can give a relevance grade [relevance will not be binary] - and will be determined according to how each document was ranked in the state of the art. For example rel_i = 1/log(1+i) for each document in the state of the art system.

If the value recieved by this evaluation function is close to 1: your system is very similar to the base line.

Example:

mySystem = [1,2,5,4,6,7] stateOfTheArt = [1,2,4,5,6,9]

First you give score to each document, according to the state of the art system [using the formula from above]:

doc1 = 1.0 doc2 = 0.6309297535714574 doc3 = 0.0 doc4 = 0.5 doc5 = 0.43067655807339306 doc6 = 0.38685280723454163 doc7 = 0 doc8 = 0 doc9 = 0.3562071871080222

Now you calculate DCG(stateOfTheArt), and use the relevance as stated above [note relevance is not binary here, and get DCG(stateOfTheArt)= 2.1100933062283396
Next, calculate it for your system using the same relecance weights and get: DCG(mySystem) = 1.9784040064803783

Thus, the evaluation is DCG(mySystem)/DCG(stateOfTheArt) = 1.9784040064803783 / 2.1100933062283396 = 0.9375907693942939

answered Sep 25 '22 14:09

amit

Kendalls tau is the metric you want. It measures the number of pairwise inversions in the list. Spearman's foot rule does the same, but measures distance rather than inversion. They are both designed for the task at hand, measuring the difference in two rank-ordered lists.

answered Sep 23 '22 14:09

James

Related questions
                            
                                Filling empty Binary tree as Binary search tree without changing structure (Node linkage)
                            
                                Hash function for string with complexity O(N)
                            
                                Pick m numbers from array of n numbers so their total differences are minimum
                            
                                TensorFlow placement algorithm
                            
                                Effiecient Algorithm for Finding if a Very Big Number is Divisible by 7
                            
                                Java OpenCV - Rectangle Detection with Hough Transform
                            
                                Priority queue (or min-heap) with O(log n) deletion of arbitrary node
                            
                                python smallest range from multiple lists
                            
                                How can I tell if this matrix is a Binary Search Tree or Binary Tree.
                            
                                Stacking boxes into fewest number of stacks efficiently?
                            
                                Efficient algorithm to find number of elements less than a query
                            
                                Find a shortest distance between two buckets of numbers
                            
                                Enumerating Cartesian product while minimizing repetition
                            
                                Find subset of points whose distance among each other is a multiple of a number
                            
                                Stumped with functional breadth-first tree traversal in Clojure?
                            
                                What is a good algorithm for getting the minimum vertex cover of a tree?
                            
                                Algorithm to convert any positive integer to an RGB value
                            
                                Second max in BST
                            
                                How does this algorithm to count the number of set bits in a 32-bit integer work?
                            
                                Reducing the time complexity of this algorithm

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Computing similarity between two lists

Tags:

algorithm

search

statistics

probability

information-retrieval

user1221572

People also ask

2 Answers

amit

James

Recent Activity

Donate For Us