Can someone help me interpret the AWS Personalize solution version metrics in layman’s terms or, at the very least, tell me what these metrics should ideally look like?
I have no knowledge of Machine Learning and wanted to take advantage of Personalize as it is marketed as a 'no-previous-knowledge-required' ML SaaS. However, the “Solution version metrics” in my solution results seem to require a fairly high level of math knowledge.
My Solution version metrics are as follows:
Normalized discounted cumulative
At 5: 0.9881, At 10: 0.9890, At 25: 0.9898
Precision
At 5: 0.1981, At 10: 0.0993, At 25: 0.0399
Mean reciprocal rank
At 25: 0.9833
Research
I have looked through the Personalize Developer's Guide which includes a short definition of each metric on page 72. I also attempted to skim through the Wikipedia articles on discounted cumulative gain and mean reciprocal rank. From reading, this is my interpretation of each metric:
NDG = Consistency of relevance of recommendations; Is the first recommendation as relevant as the last?
Precision = Relevance of recommendations to user; How relevant are your recommendations to users across the board?
MRR = Relevance of first recommendation in the list versus the others in the list; How relevant is your first recommendation to each user?
If these interpretations are right, then my solution metrics indicate that I am highly consistent about recommending irrelevant content. Is that a valid conclusion?
Recipes and solutions – Amazon Personalize uses recipes, which are the combination of the learning algorithm with the hyperparameters and datasets used. Training a model with different recipes leads to different results. The resultant models that are deployed are referred to as a solution version.
Transactions per second (TPS) is the throughput and unit of billing for Amazon Personalize. The minimum provisioned TPS ( minProvisionedTPS ) specifies the baseline throughput provisioned by Amazon Personalize, and thus, the minimum billing charge.
Amazon Personalize is a fully managed machine learning service that goes beyond rigid, static rule-based recommendation systems and trains, tunes, and deploys custom ML models to deliver highly customized recommendations to customers across industries such as retail and media and entertainment.
The answer is: yes, it's expensive.
Alright, my company has Developer Tier Support so I was able to get an answer to this question from AWS.
Answer Summary
The metrics are better the closer they are to '1'. My interpretation of my metrics was pretty much correct but my conclusion was not.
Apparently, these metrics (and Personalize in general) do not take into account how much a user likes an item. Personalize only cares how soon a relevant recommendation gets to the user. This makes sense because if you get the 25th item in a queue and don't like anything you've seen, you are not likely to continue looking.
Given this, what's happening in my solution is that the first-ish recommendation is relevant but none of the others are.
Detailed Answer from AWS
I will start with relatively easier question first: What are the ideal values for these metrics, so that a solution version can be preferred over another solution version? The answer to the above question is that for each metric, higher numbers are better. [1] If you have more than one solution version, please prefer the solution version with higher values for these metrics. Please note that you can create number of solution versions by Overriding Default Recipe Parameters [2]. And by using Hyperparameters [3].
The second question: How to understand and interpret the metrics for AWS Personalize Solution version? I can confirm from my research that the definitions and interpretation provided for these metrics in the case by you are valid.
Before I explain each metric, here is a primer for one of the main concept in Machine Learning. How these metrics are calculated? The Model training step during the creation of solution version splits the input dataset into two parts, a training dataset (~70%) and test dataset (~30%). The training dataset is used during the Model training. Once the model is trained, it is used to predict the values for test dataset. Once the prediction is made it is validated against the known (and correct) value in the test dataset. [4]
I researched further to find more resources to understand the concept behind these metrics and also elaborate further an example provided in the AWS documentation. [1]
"mean_reciprocal_rank_at_25"
Let’s first understand Reciprocal Rank: For example, a movie streaming service uses a solution version to predict a list of 5 recommended movies for a specific user i.e A, B, C, D, E. Once these 5 recommended movies are compared against the actual movies liked by that user (in the test dataset) we find out that only movie B and E are actually liked by the user. The Reciprocal Rank will only consider the first relevant (correct according to test dataset) recommendation which is movie B located at rank 2 and it will ignore the movie E located at rank 5. Thus the Reciprocal Rank will be 1/2 = 0.5Now let’s expand the above example to understand Mean Reciprocal Rank: [5] Let’s assume that we ran predictions for three users and below movies were recommended.
User 1: A, B, C, D, E (user liked B and E, thus the Reciprocal Rank is 1/2)
User 2: F, G, H, I, J (user liked H and I, thus the Reciprocal Rank is 1/3)
User 3: K, L, M, N, O (user liked K, M and N, thus the Reciprocal Rank is 1)
The Mean Reciprocal Rank will be sum of all the individual Reciprocal Ranks divided by the total number of queries ran for predictions, which is 3. (1/2 + 1/3 + 1)/3 = (0.5+0.33+1)/3 = (1.83)/3 = 0.61In case of AWS Personalize Solution version metrics, the mean of the reciprocal ranks of the first relevant recommendation out of the top 25 recommendations over all queries is called “mean_reciprocal_rank_at_25”.
"precision_at_K"
It can be stated as the capability of a model for delivering the relevant elements with the least amount of recommendations. The concept of precision is described in the following free video available at Coursera. [6] A very good article on the same topic can be found here. [7]Let’s consider the same example, a movie streaming service uses a solution version to predict a list of 5 recommended movies for a specific user i.e; A, B, C, D, E. Once these 5 recommended movies are compared against the actual movies liked by that user (correct values in the test dataset) we find out that only movie B and E are actually liked by the user. The precision_at_5 will be 2 correctly predicted movies out of total 5 movies and can be stated as 2/5=0.4
"normalized_discounted_cumulative_gain_at_K"
This metric use the concept of Logarithm and Logarithmic Scale to assign weighting factor to relevant items (correct values in the test dataset). The full description of Logarithm and Logarithmic Scale is beyond the scope of this document. The main objective of using Logarithmic scale is to reduce wide-ranging quantities to tiny scopes.discounted_cumulative_gain_at_K
Let’s consider the same example, a movie streaming service uses a solution version to predict a list of 5 recommended movies for a specific user i.e; A, B, C, D, E. Once these 5 recommended movies are compared against the actual movies liked by that user (correct values in the test dataset) we find out that only movie B and E are actually liked by the user. To produce the cumulative discounted gain (DCG) at 5, each relevant item is assigned a weighting factor (using Logarithmic Scale) based on its position in the top 5 recommendations. The value produced by this formula is called as “discounted value”.
The formula is 1/log(1 + position)
As B is at position 2 so the discounted value is = 1/log(1 + 2)
As E is at position 5 so the discounted value is = 1/log(1 + 5)
The cumulative discounted gain (DCG) is calculated by adding discounted values for both relevant items DCG = ( 1/log(1 + 2) + 1/log(1 + 5) )normalized_discounted_cumulative_gain_at_K
First of all, what is “ideal DCG”? In the above example the ideal predictions should look like B, E, A, C, D. Thus the relevant items should be at number 1 and 2 in ideal case. To produce the “ideal DCG” at 5, each relevant item is assigned a weighting factor (using Logarithmic Scale) based on its position in the top 5 recommendations. The value produced by this formula is called as “discounted value”.
The formula is 1/log(1 + position).
As B is at position 1 so the discounted value is = 1/log(1 + 1)
As E is at position 2 so the discounted value is = 1/log(1 + 2)
The ideal DCG is calculated by adding discounted values for both relevant items DCG = ( 1/log(1 + 1) + 1/log(1 + 2) )The normalized discounted cumulative gain (NDCG) is the DCG divided by the “ideal DCG”. DCG / ideal DCG = (1/log(1 + 2) + 1/log(1 + 5)) / (1/log(1 + 1) + 1/log(1 + 2)) = 0.6241
I hope the information provided above is helpful in understanding the concept behind these metrics.
[1] https://docs.aws.amazon.com/personalize/latest/dg/working-with-training-metrics.html
[2] https://docs.aws.amazon.com/personalize/latest/dg/customizing-solution-config.html
[3] https://docs.aws.amazon.com/personalize/latest/dg/customizing-solution-config-hpo.html
[4] https://medium.com/@m_n_malaeb/recall-and-precision-at-k-for-recommender-systems-618483226c54
[5] https://www.blabladata.com/2014/10/26/evaluating-recommender-systems/
[6] https://www.coursera.org/lecture/ml-foundations/optimal-recommenders-4EQc2
[7] https://medium.com/@bond.kirill.alexandrovich/precision-and-recall-in-recommender-systems-and-some-metrics-stuff-ca2ad385c5f8
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With