In information retrieval evaluation, what would precision@k be, if fewer than k documents are retrieved? Let's say only 5 documents were retrieved, of which 3 are relevant. Would the precision@10 be 3/10 or 3/5?
In the computation of precision@k, we are dividing by the number of items recommended in the top-k recommendation. If there are no items recommended. i.e. number of recommended items at k is zero, we cannot compute precision at k since we cannot divide by zero.
precision as P=|X∪Y||X| and recall as R=|X∪Y||Y| both P and R are defined w.r.t a set of retrieved documents.
Definition. Average precision is a measure that combines recall and precision for ranked retrieval results. For one information need, the average precision is the mean of the precision scores after each relevant document is retrieved.
In other words k is just the number of articles that you looked at, and Precision@k is the percentage of those articles that are relevant to you. If you looked at a second page of results, k would become 20.
It can be hard to find text defining edge cases of measures like this, and the mathematical formulations often don't deal with the incompleteness of data. For issues like this, I tend to turn to the decision made by trec_eval
which a tool distributed by NIST that has implementations of all common retrieval measures, especially those used by the challenges in Text Retrieval Conferences (TREC challenges).
Per the metric description in m_P.c
of trec_eval
9.0 (called the latest on this page):
Precision measured at various doc level cutoffs in the ranking.
If the cutoff is larger than the number of docs retrieved, then
it is assumed nonrelevant docs fill in the rest. Eg, if a method
retrieves 15 docs of which 4 are relevant, then P20 is 0.2 (4/20).
Precision is a very nice user oriented measure, and a good comparison
number for a single topic, but it does not average well. For example,
P20 has very different expected characteristics if there 300
total relevant docs for a topic as opposed to 10.
This means that you should always divide by k
even if fewer than k
were retrieved, so the precision would be 0.3 instead of 0.6 in your particular case. (Punish the system for retrieving fewer than k
).
The other tricky case is when there are fewer than k
relevant documents. This is why they note that precision is a helpful measure but does not average well.
Some measures that are more robust to these issues are: Normalized Discounted Cumulative Gain (NDCG) which compares the ranking to an ideal ranking (at a cutoff) and (simpler) R-Precision: which calculates precision at the number of relevant documents, rather than a fixed k
. So that one query may calculate P@15 for R=15, and another may calculate P@200 for R=200.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With