Mean average precision computed at k (for top-k elements in the answer), according to wiki, ml metrics at kaggle, and this answer: Confusion about (Mean) Average Precision should be computed as mean of average precisions at k, where average precision at k is computed as:
Where: P(i) is the precision at cut-off i in the list; rel(i) is an indicator function equaling 1 if the item at rank i is a relevant document, zero otherwise.
The divider min(k, number of relevant documents)
has the meaning of maximum possible number of relevant entries in the answer.
Is this understanding correct?
Is MAP@k always less than MAP computed for all ranked list?
My concern is that, this is not how MAP@k is computed in many works.
It is typical, that the divider is not min(k, number of relevant documents)
, but the number of relative documents in the top-k. This approach will give higher value of MAP@k.
HashNet: Deep Learning to Hash by Continuation" (ICCV 2017)
Code: https://github.com/thuml/HashNet/blob/master/pytorch/src/test.py#L42-L51
for i in range(query_num):
label = validation_labels[i, :]
label[label == 0] = -1
idx = ids[:, i]
imatch = np.sum(database_labels[idx[0:R], :] == label, axis=1) > 0
relevant_num = np.sum(imatch)
Lx = np.cumsum(imatch)
Px = Lx.astype(float) / np.arange(1, R+1, 1)
if relevant_num != 0:
APx.append(np.sum(Px * imatch) / relevant_num)
Where relevant_num
is not the min(k, number of relevant documents)
, but number of relevant documents in the result, which is not the same as total number of relative documents or k.
Am I reading wrong the code?
Deep Visual-Semantic Quantization of Efficient Image Retrieval CVPR 2017
Code: https://github.com/caoyue10/cvpr17-dvsq/blob/master/util.py#L155-L178
def get_mAPs_by_feature(self, database, query):
ips = np.dot(query.output, database.output.T)
#norms = np.sqrt(np.dot(np.reshape(np.sum(query.output ** 2, 1), [query.n_samples, 1]), np.reshape(np.sum(database.output ** 2, 1), [1, database.n_samples])))
#self.all_rel = ips / norms
self.all_rel = ips
ids = np.argsort(-self.all_rel, 1)
APx = []
query_labels = query.label
database_labels = database.label
print "#calc mAPs# calculating mAPs"
bar = ProgressBar(total=self.all_rel.shape[0])
for i in xrange(self.all_rel.shape[0]):
label = query_labels[i, :]
label[label == 0] = -1
idx = ids[i, :]
imatch = np.sum(database_labels[idx[0: self.R], :] == label, 1) > 0
rel = np.sum(imatch)
Lx = np.cumsum(imatch)
Px = Lx.astype(float) / np.arange(1, self.R+1, 1)
if rel != 0:
APx.append(np.sum(Px * imatch) / rel)
bar.move()
print "mAPs: ", np.mean(np.array(APx))
return np.mean(np.array(APx))
Where divider is rel
, which is computed as np.sum(imatch)
, where imatch
is a binary vector that indicates if the entry is relevant or not. The problem is that it takes only first R
: imatch = np.sum(database_labels[idx[0: self.R], :] == label, 1) > 0
. So np.sum(imatch)
will give number of relevant entries in the returned list of size R
, but not min(R, number of relevant entries)
. And note that values of R
used in the paper are less than number of entries in DB.
Deep Learning of Binary Hash Codes for Fast Image Retrieval (CVPR 2015)
Code: https://github.com/kevinlin311tw/caffe-cvprw15/blob/master/analysis/precision.m#L30-L55
buffer_yes = zeros(K,1);
buffer_total = zeros(K,1);
total_relevant = 0;
for j = 1:K
retrieval_label = trn_label(y2(j));
if (query_label==retrieval_label)
buffer_yes(j,1) = 1;
total_relevant = total_relevant + 1;
end
buffer_total(j,1) = 1;
end
% compute precision
P = cumsum(buffer_yes) ./ Ns';
if (sum(buffer_yes) == 0)
AP(i) = 0;
else
AP(i) = sum(P.*buffer_yes) / sum(buffer_yes);
end
Here the divider is sum(buffer_yes)
which is number of the relative documents in the returned list of size k, not min(k, number of relevant documents)
.
"Supervised Learning of Semantics-Preserving Deep Hashing" (TPAMI 2017)
Code: https://github.com/kevinlin311tw/Caffe-DeepBinaryCode/blob/master/analysis/precision.m
Code is the same as in the previouse paper.
Learning Compact Binary Descriptors with Unsupervised Deep Neural Networks (CVPR 2016)
Same code: https://github.com/kevinlin311tw/cvpr16-deepbit/blob/master/analysis/precision.m#L32-L55
Am I missing something? Is the code in the papers above correct? Why it does not coincide with https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py#L25-L39 ?
I found this closed issue, referring the same problem: https://github.com/thuml/HashNet/issues/2
Is claim the following claim correct?
AP is a ranking metric. If the top 2 retrievals in the ranked list are relevant (and only the top 2), AP is 100%. You're talking about Recall, which in this case is indeed 0.2%.
From my understanding, if we treat AP as area under PR curve, the claim above is not correct.
P.S. I was in doubt if this should go to Cross Validated or to StackOverflow. If you think that it is better to place it to Cross Validated I don't mind. My reasoning was that it is not a theoretical question, but implementation one with reference to actual code.
Firstly, we need to compute the AP at an arbitrary threshold k of each dataset. Then, we simply sum up and find the mean of AP@k of every dataset to get the mAP@k.
In the computation of precision@k, we are dividing by the number of items recommended in the top-k recommendation. If there are no items recommended. i.e. number of recommended items at k is zero, we cannot compute precision at k since we cannot divide by zero.
The mAP is calculated by finding Average Precision(AP) for each class and then average over a number of classes. The mAP incorporates the trade-off between precision and recall and considers both false positives (FP) and false negatives (FN). This property makes mAP a suitable metric for most detection applications.
The mean Average Precision or mAP score is calculated by taking the mean AP over all classes and/or overall IoU thresholds, depending on different detection challenges that exist. In PASCAL VOC2007 challenge, AP for one object class is calculated for an IoU threshold of 0.5.
You are completely right and well done for finding this. Given the similarity of code, my guess is there is one source bug, and then papers after papers copied the bad implementation without examining it closely.
The "akturtle" issue raiser is completely right too, I was going to give the same example. I'm not sure if "kunhe" understood the argument, of course recall matters when computing average precision.
Yes, the bug should inflate the numbers. I just hope that the ranking lists are long enough and that the methods are reasonable enough such that they achieve 100% recall in the ranked list, in which case the bug would not affect the results.
Unfortunately it's hard for reviewers to catch this as typically one doesn't review code of papers.. It's worth contacting authors to try to make them update the code, update their papers with correct numbers, or at least don't continue making the mistake in their future works. If you are planning to write a paper comparing different methods, you could point out the problem and report the correct numbers (as well as potentially the ones with the bug just to make apples for apples comparisons).
To answer your side-question:
Is MAP@k always less than MAP computed for all ranked list?
Not necessarily, MAP@k is essentially computing the MAP while normalizing for the potential case where you can't do any better given just k retrievals. E.g. consider returned ranked list with relevances: 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 and assume there are in total 6 relevant documents. MAP should be slightly higher than 50% here, while MAP@3 = 100% because you can't do any better than retrieving 1 1 1. But this is unrelated to the bug you discovered as with their bug the MAP@k is guaranteed to be at least as large as the true MAP@k.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With