Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is NDCG (normalized discounted gain) flawed? I have calculated a few alternative ranking quality measures, and I can't make heads or tails of it

I'm using python for a learning-to-rank problem, and I am evaluating my success using the following DCG and NDCG code (from http://nbviewer.ipython.org/github/ogrisel/notebooks/blob/master/Learning%20to%20Rank.ipynb )

def dcg(relevances, rank=20):
    relevances = np.asarray(relevances)[:rank]
    n_relevances = len(relevances)
    if n_relevances == 0:
        return 0.
    discounts = np.log2(np.arange(n_relevances) + 2)
    return np.sum(relevances / discounts)

def ndcg(relevances, rank=20):
    best_dcg = dcg(sorted(relevances, reverse=True), rank)
    if best_dcg == 0:
        return 0.
    return dcg(relevances, rank) / best_dcg

Here are the DCG values for the best and worst case scenarios in a list of 3 items with no duplicate ranks...

>>> ndcg(np.asarray([3,2,1]))
1.0
>>> ndcg(np.asarray([1,2,3]))
0.78999800424603583

We can use this metric to compare the two rankings and see which is better. If I calculate the worst case for a 4 item list, however...

>>> ndcg(np.asarray([1,2,3,4]))
0.74890302967841715

The 4 item list no longer seems comparable to the 3 item list.

I have also calculated two alternative NDCGs. NDCG2 compares the achieved dcg to bot the best and worst case...

def ndcg2(relevances, rank=20):
    best_dcg = dcg(sorted(relevances, reverse=True), rank)
    worst_dcg=dcg(sorted(relevances, reverse=False),rank)
    if best_dcg == 0:
        return 0.
    return (dcg(relevances, rank)-worst_dcg) / (best_dcg-worst_dcg)

NDCG randomizes my list of actual rankings 50 times, calculates dcg for each, and compares that to my actual DCG.

def ndcg3(relevances, rank=20):
    shuffled=np.copy(relevances)
    rands=[]
    for i in range(50):
        np.random.shuffle(shuffled)
        rands.append(dcg(shuffled,rank))
    avg_rand_dcg=np.mean(np.asarray(rands))
    return dcg(relevances, rank) / avg_rand_dcg

Across my various lists, I get the following metrics...

  • NDCG: average is .87 (sounds good)
  • Spearman rank: around .25 (not amazing, but there is something there)
  • NDCG2: .58 (on average, slightly closer to the best dcg than it is to the worst)
  • NDCG3: 1.04 (slightly better than randomly sorted lists)

I honestly can't make heads or tails of these results. My NDCG values seem good, but are they really comparable across lists? Do the alternative metrics make more sense?

edit: In my first random comparison, I was not using np.copy(). As such, my random score was almost always .99. That is now fixed and results make more sense.

like image 651
neelshiv Avatar asked Sep 30 '22 18:09

neelshiv


1 Answers

One think that may mislead you is the way to normalize NDCG. Usually, you have a number of documents to rank, but your NDCG is truncated at a smaller number of documents (for example NCDG@3). In your code, this is determined by the parameter 'rank'.

Let's say that you want to rank 5 documents with relevances R = [1, 2, 3, 4, 0], and compute NDCG@3. If your algorithm believes that the optimal order is [doc1, doc2, doc3, doc4, doc5], then you will have:

NDCG@3 = DCG([1, 2, 3]) / DCG([4, 3, 2])

and not

NDCG@3 = DGC([1, 2, 3]) / DCG([3, 2, 1])   # Incorrect

So in a sense, NDCG([1, 2, 3]) and NDCG([1, 2, 3, 4]) are not comparable. The numerators are quite the same, but the denominators are completely different. If you want NDCG to have an intuitive meaning, you have to set 'rank' smaller or equal to your number of documents.

like image 95
Clément Vgnc Avatar answered Oct 03 '22 01:10

Clément Vgnc