I'm trying to evaluate a model based on its performance on historical sports beting.
I have a dataset that consists of the following columns:
feature1 | ... | featureX | oddsPlayerA | oddsPlayerB | winner
The model will be doing a regression where the output is the odds that playerA wins the match
It is my understanding that I can use a custom scoring function to return the "money" the model would have made if it bet every time a condition is true and use that value to measure the fitness of the model. A condition something like:
if prediction_player_A_win_odds < oddsPlayerA
money += bet_playerA(oddsPlayerA, winner)
if inverse_odd(prediction_player_A_win_odds) < oddsPlayerB
money += bet_playerB(oddsPlayerB, winner)
In the custom scoring function I need to receive the usual arguments like "ground_truth, predictions" (where ground_truth is the winner[] and predictions is prediction_player_A_win_odds[]) but also the fields "oddsPlayerA" and "oddsPlayerB" from the dataset (and here is the problem!).
If the custom scoring function was called with data in the exact same order as the original dataset it would be trivial to retrieve this extra data needed from the dataset. But in reality when using cross validation methods the data it gets is all mixed up (when compared to the original).
I've tried the most obvious approach which was to pass the y variable with [oddsA, oddsB, winner] (dimensions [n, 3]) but scikit didn't allow it.
So, how can I get data from the dataset into the custom scoring function that is neither X nor y but is still "tied together" in the same order?
Overview. In Python, the f1_score function of the sklearn. metrics package calculates the F1 score for a set of predicted labels. The F1 score is the harmonic mean of precision and recall, as shown below: F1_score = 2 * (precision * recall) / (precision + recall)
The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0. From sklearn documentation. Save this answer.
The process that cross_val_score uses is typical for cross validation and follows these steps: The number of folds is defined, by default this is 5. The dataset is split up according to these folds, where each fold has a unique set of testing data. A model is trained and tested for each fold.
The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset. >>> from sklearn. model_selection import cross_val_score >>> clf = svm.
There is no way to actually do this at the moment, sorry. You can write your own loop over the cross-validation folds, which should not be to hard. You can not do this using GridSearchCV
or cross_val_score
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With