MAP@k computation

Tags:

Mean average precision computed at k (for top-k elements in the answer), according to wiki, ml metrics at kaggle, and this answer: Confusion about (Mean) Average Precision should be computed as mean of average precisions at k, where average precision at k is computed as:

enter image description here

Where: P(i) is the precision at cut-off i in the list; rel(i) is an indicator function equaling 1 if the item at rank i is a relevant document, zero otherwise.

The divider min(k, number of relevant documents) has the meaning of maximum possible number of relevant entries in the answer.

Is this understanding correct?

Is MAP@k always less than MAP computed for all ranked list?

My concern is that, this is not how MAP@k is computed in many works.

It is typical, that the divider is not min(k, number of relevant documents), but the number of relative documents in the top-k. This approach will give higher value of MAP@k.

HashNet: Deep Learning to Hash by Continuation" (ICCV 2017)

Code: https://github.com/thuml/HashNet/blob/master/pytorch/src/test.py#L42-L51

    for i in range(query_num):
        label = validation_labels[i, :]
        label[label == 0] = -1
        idx = ids[:, i]
        imatch = np.sum(database_labels[idx[0:R], :] == label, axis=1) > 0
        relevant_num = np.sum(imatch)
        Lx = np.cumsum(imatch)
        Px = Lx.astype(float) / np.arange(1, R+1, 1)
        if relevant_num != 0:
            APx.append(np.sum(Px * imatch) / relevant_num)

Where relevant_num is not the min(k, number of relevant documents), but number of relevant documents in the result, which is not the same as total number of relative documents or k.

Am I reading wrong the code?

Deep Visual-Semantic Quantization of Efficient Image Retrieval CVPR 2017

Code: https://github.com/caoyue10/cvpr17-dvsq/blob/master/util.py#L155-L178

def get_mAPs_by_feature(self, database, query):
    ips = np.dot(query.output, database.output.T)
    #norms = np.sqrt(np.dot(np.reshape(np.sum(query.output ** 2, 1), [query.n_samples, 1]), np.reshape(np.sum(database.output ** 2, 1), [1, database.n_samples])))
    #self.all_rel = ips / norms
    self.all_rel = ips
    ids = np.argsort(-self.all_rel, 1)
    APx = []
    query_labels = query.label
    database_labels = database.label
    print "#calc mAPs# calculating mAPs"
    bar = ProgressBar(total=self.all_rel.shape[0])
    for i in xrange(self.all_rel.shape[0]):
        label = query_labels[i, :]
        label[label == 0] = -1
        idx = ids[i, :]
        imatch = np.sum(database_labels[idx[0: self.R], :] == label, 1) > 0
        rel = np.sum(imatch)
        Lx = np.cumsum(imatch)
        Px = Lx.astype(float) / np.arange(1, self.R+1, 1)
        if rel != 0:
            APx.append(np.sum(Px * imatch) / rel)
        bar.move()
    print "mAPs: ", np.mean(np.array(APx))
    return np.mean(np.array(APx))

Where divider is rel, which is computed as np.sum(imatch), where imatch is a binary vector that indicates if the entry is relevant or not. The problem is that it takes only first R: imatch = np.sum(database_labels[idx[0: self.R], :] == label, 1) > 0. So np.sum(imatch) will give number of relevant entries in the returned list of size R, but not min(R, number of relevant entries). And note that values of R used in the paper are less than number of entries in DB.

Deep Learning of Binary Hash Codes for Fast Image Retrieval (CVPR 2015)

Code: https://github.com/kevinlin311tw/caffe-cvprw15/blob/master/analysis/precision.m#L30-L55

    buffer_yes = zeros(K,1);
    buffer_total = zeros(K,1);
    total_relevant = 0;
    
    for j = 1:K
        retrieval_label = trn_label(y2(j));
        
        if (query_label==retrieval_label)
            buffer_yes(j,1) = 1;
            total_relevant = total_relevant + 1;
        end
        buffer_total(j,1) = 1;
    end
    
    % compute precision
    P = cumsum(buffer_yes) ./ Ns';
    
    if (sum(buffer_yes) == 0)
        AP(i) = 0;
    else
        AP(i) = sum(P.*buffer_yes) / sum(buffer_yes);
    end

Here the divider is sum(buffer_yes) which is number of the relative documents in the returned list of size k, not min(k, number of relevant documents).

"Supervised Learning of Semantics-Preserving Deep Hashing" (TPAMI 2017)

Code: https://github.com/kevinlin311tw/Caffe-DeepBinaryCode/blob/master/analysis/precision.m

Code is the same as in the previouse paper.

Learning Compact Binary Descriptors with Unsupervised Deep Neural Networks (CVPR 2016)

Same code: https://github.com/kevinlin311tw/cvpr16-deepbit/blob/master/analysis/precision.m#L32-L55

Am I missing something? Is the code in the papers above correct? Why it does not coincide with https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py#L25-L39 ?

Update

I found this closed issue, referring the same problem: https://github.com/thuml/HashNet/issues/2

Is claim the following claim correct?

AP is a ranking metric. If the top 2 retrievals in the ranked list are relevant (and only the top 2), AP is 100%. You're talking about Recall, which in this case is indeed 0.2%.

From my understanding, if we treat AP as area under PR curve, the claim above is not correct.

P.S. I was in doubt if this should go to Cross Validated or to StackOverflow. If you think that it is better to place it to Cross Validated I don't mind. My reasoning was that it is not a theoretical question, but implementation one with reference to actual code.

473

asked Mar 03 '19 06:03

Podgorskiy

1 Answers

You are completely right and well done for finding this. Given the similarity of code, my guess is there is one source bug, and then papers after papers copied the bad implementation without examining it closely.

The "akturtle" issue raiser is completely right too, I was going to give the same example. I'm not sure if "kunhe" understood the argument, of course recall matters when computing average precision.

Yes, the bug should inflate the numbers. I just hope that the ranking lists are long enough and that the methods are reasonable enough such that they achieve 100% recall in the ranked list, in which case the bug would not affect the results.

Unfortunately it's hard for reviewers to catch this as typically one doesn't review code of papers.. It's worth contacting authors to try to make them update the code, update their papers with correct numbers, or at least don't continue making the mistake in their future works. If you are planning to write a paper comparing different methods, you could point out the problem and report the correct numbers (as well as potentially the ones with the bug just to make apples for apples comparisons).

To answer your side-question:

Is MAP@k always less than MAP computed for all ranked list?

Not necessarily, MAP@k is essentially computing the MAP while normalizing for the potential case where you can't do any better given just k retrievals. E.g. consider returned ranked list with relevances: 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 and assume there are in total 6 relevant documents. MAP should be slightly higher than 50% here, while MAP@3 = 100% because you can't do any better than retrieving 1 1 1. But this is unrelated to the bug you discovered as with their bug the MAP@k is guaranteed to be at least as large as the true MAP@k.

192

answered Sep 28 '22 18:09

Relja Arandjelović

Related questions
                            
                                What's the correct way to compute a confusion matrix for object detection?
                            
                                Python - "xor"ing each byte in "bytes" in the most efficient way
                            
                                VS Code run ipython in debug console
                            
                                How to use tensorflow debugging tool tfdbg on tf.estimator in Tensorflow?
                            
                                Python install error "A newer version of the Python launcher is already installed"
                            
                                Detecting paste in python
                            
                                Bundling Python3 packages for PySpark results in missing imports
                            
                                Training a RNN to output word2vec embedding instead of logits
                            
                                How to unit test Neo4j in python ?
                            
                                How to pass the arguments to the new_callable from mock.patch?
                            
                                Remove error from mypy for attributes set dynamically in a Python class
                            
                                combine several GIF horizontally - python
                            
                                Error while installing debian packages programmitically using apt_pkg
                            
                                How to create Keras model with optional inputs
                            
                                How can I wait until I receive data using a Python socket?
                            
                                vscode python remote interpreter
                            
                                Pip install - do downloaded whl files persist & take disk space?
                            
                                Given a list of words and a sentence find all words that appear in the sentence either in whole or as a substring
                            
                                Training hyperparameters for multidimensional Gaussian process regression
                            
                                How to change the length of a Primary Key field in Alembic?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

MAP@k computation

Tags:

python

matlab

information-retrieval

precision-recall

average-precision

Update

Podgorskiy

People also ask

1 Answers

Relja Arandjelović

Recent Activity

Donate For Us