Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit-learn custom score function needs values from dataset other than X and y

I'm trying to evaluate a model based on its performance on historical sports beting.

I have a dataset that consists of the following columns:

feature1 | ... | featureX | oddsPlayerA | oddsPlayerB | winner

The model will be doing a regression where the output is the odds that playerA wins the match

It is my understanding that I can use a custom scoring function to return the "money" the model would have made if it bet every time a condition is true and use that value to measure the fitness of the model. A condition something like:

if prediction_player_A_win_odds < oddsPlayerA
   money += bet_playerA(oddsPlayerA, winner) 
if inverse_odd(prediction_player_A_win_odds) < oddsPlayerB
   money += bet_playerB(oddsPlayerB, winner) 

In the custom scoring function I need to receive the usual arguments like "ground_truth, predictions" (where ground_truth is the winner[] and predictions is prediction_player_A_win_odds[]) but also the fields "oddsPlayerA" and "oddsPlayerB" from the dataset (and here is the problem!).

If the custom scoring function was called with data in the exact same order as the original dataset it would be trivial to retrieve this extra data needed from the dataset. But in reality when using cross validation methods the data it gets is all mixed up (when compared to the original).

I've tried the most obvious approach which was to pass the y variable with [oddsA, oddsB, winner] (dimensions [n, 3]) but scikit didn't allow it.

So, how can I get data from the dataset into the custom scoring function that is neither X nor y but is still "tied together" in the same order?

like image 684
joaoroque Avatar asked Nov 03 '14 00:11

joaoroque


People also ask

What is score function in scikit-learn?

Overview. In Python, the f1_score function of the sklearn. metrics package calculates the F1 score for a set of predicted labels. The F1 score is the harmonic mean of precision and recall, as shown below: F1_score = 2 * (precision * recall) / (precision + recall)

What is a good score sklearn?

The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0. From sklearn documentation. Save this answer.

How do you use cross Val score?

The process that cross_val_score uses is typical for cross validation and follows these steps: The number of folds is defined, by default this is 5. The dataset is split up according to these folds, where each fold has a unique set of testing data. A model is trained and tested for each fold.

How do I import cross validation in python?

The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset. >>> from sklearn. model_selection import cross_val_score >>> clf = svm.


1 Answers

There is no way to actually do this at the moment, sorry. You can write your own loop over the cross-validation folds, which should not be to hard. You can not do this using GridSearchCV or cross_val_score

like image 173
Andreas Mueller Avatar answered Nov 02 '22 23:11

Andreas Mueller