Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does sklearn support a cost matrix?

Is it possible to train classifiers in sklearn with a cost matrix with different costs for different mistakes? For example in a 2 class problem, the cost matrix would be a 2 by 2 square matrix. For example A_ij = cost of classifying i as j.

The main classifier I am using is a Random Forest.

Thanks.

like image 225
user3897638 Avatar asked Aug 01 '14 00:08

user3897638


3 Answers

The cost-sensitive framework you describe is not supported in scikit-learn, in any of the classifiers we have.

like image 87
Gilles Louppe Avatar answered Oct 21 '22 14:10

Gilles Louppe


You could use a custom scoring function that accepts a matrix of per-class or per-instance costs. Here's an example of a scorer that calculates per-instance misclassification cost:

def financial_loss_scorer(y, y_pred, **kwargs):
    import pandas as pd

    totals = kwargs['totals']

    # Create an indicator - 0 if correct, 1 otherwise
    errors = pd.DataFrame((~(y == y_pred)).astype(int).rename('Result'))
    # Use the product totals dataset to create results
    results = errors.merge(totals, left_index=True, right_index=True, how='inner')
    # Calculate per-prediction loss
    loss = results.Result * results.SumNetAmount

    return loss.sum()

The scorer becomes:

make_scorer(financial_loss_scorer, totals=totals_data, greater_is_better=False)

Where totals_data is a pandas.DataFrame with indexes that match the training set indexes.

like image 4
cdlm Avatar answered Oct 21 '22 15:10

cdlm


You could always just look at your ROC curve. Each point on the ROC curve corresponds to a separate confusion matrix. So by specifying the confusion matrix you want, via choosing your classifier threshold implies some sort of cost weighting scheme. Then you just have to choose the confusion matrix that would imply the cost matrix you are looking for.

On the other hand if you really had your heart set on it, and really want to "train" an algorithm using a cost matrix, you could "sort of" do it in sklearn.

Although it is impossible to directly train an algorithm to be cost sensitive in sklearn you could use a cost matrix sort of setup to tune your hyper-parameters. I've done something similar to this using a genetic algorithm. It really doesn't do a great job, but it should give a modest boost to performance.

like image 3
Ryan Avatar answered Oct 21 '22 14:10

Ryan