I'm using python for a learning-to-rank problem, and I am evaluating my success using the following DCG and NDCG code (from http://nbviewer.ipython.org/github/ogrisel/notebooks/blob/master/Learning%20to%20Rank.ipynb )
def dcg(relevances, rank=20):
relevances = np.asarray(relevances)[:rank]
n_relevances = len(relevances)
if n_relevances == 0:
return 0.
discounts = np.log2(np.arange(n_relevances) + 2)
return np.sum(relevances / discounts)
def ndcg(relevances, rank=20):
best_dcg = dcg(sorted(relevances, reverse=True), rank)
if best_dcg == 0:
return 0.
return dcg(relevances, rank) / best_dcg
Here are the DCG values for the best and worst case scenarios in a list of 3 items with no duplicate ranks...
>>> ndcg(np.asarray([3,2,1]))
1.0
>>> ndcg(np.asarray([1,2,3]))
0.78999800424603583
We can use this metric to compare the two rankings and see which is better. If I calculate the worst case for a 4 item list, however...
>>> ndcg(np.asarray([1,2,3,4]))
0.74890302967841715
The 4 item list no longer seems comparable to the 3 item list.
I have also calculated two alternative NDCGs. NDCG2 compares the achieved dcg to bot the best and worst case...
def ndcg2(relevances, rank=20):
best_dcg = dcg(sorted(relevances, reverse=True), rank)
worst_dcg=dcg(sorted(relevances, reverse=False),rank)
if best_dcg == 0:
return 0.
return (dcg(relevances, rank)-worst_dcg) / (best_dcg-worst_dcg)
NDCG randomizes my list of actual rankings 50 times, calculates dcg for each, and compares that to my actual DCG.
def ndcg3(relevances, rank=20):
shuffled=np.copy(relevances)
rands=[]
for i in range(50):
np.random.shuffle(shuffled)
rands.append(dcg(shuffled,rank))
avg_rand_dcg=np.mean(np.asarray(rands))
return dcg(relevances, rank) / avg_rand_dcg
Across my various lists, I get the following metrics...
I honestly can't make heads or tails of these results. My NDCG values seem good, but are they really comparable across lists? Do the alternative metrics make more sense?
edit: In my first random comparison, I was not using np.copy(). As such, my random score was almost always .99. That is now fixed and results make more sense.
One think that may mislead you is the way to normalize NDCG. Usually, you have a number of documents to rank, but your NDCG is truncated at a smaller number of documents (for example NCDG@3). In your code, this is determined by the parameter 'rank'.
Let's say that you want to rank 5 documents with relevances R = [1, 2, 3, 4, 0], and compute NDCG@3. If your algorithm believes that the optimal order is [doc1, doc2, doc3, doc4, doc5], then you will have:
NDCG@3 = DCG([1, 2, 3]) / DCG([4, 3, 2])
and not
NDCG@3 = DGC([1, 2, 3]) / DCG([3, 2, 1]) # Incorrect
So in a sense, NDCG([1, 2, 3]) and NDCG([1, 2, 3, 4]) are not comparable. The numerators are quite the same, but the denominators are completely different. If you want NDCG to have an intuitive meaning, you have to set 'rank' smaller or equal to your number of documents.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With