Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do regression as opposed to classification using logistic regression and scikit learn

The target variable that I need to predict are probabilities (as opposed to labels). The corresponding column in my training data are also in this form. I do not want to lose information by thresholding the targets to create a classification problem out of it.

If I train the logistic regression classifier with binary labels, sk-learn logistic regression API allows obtaining the probabilities at prediction time. However, I need to train it with probabilities. Is there a way to do this in scikits-learn, or a suitable Python package that scales to 100K data points of 1K dimension.

like image 566
san Avatar asked Oct 28 '22 22:10

san


1 Answers

I want the regressor to use the structure of the problem. One such structure is that the targets are probabilities.

You can't have cross-entropy loss with non-indicator probabilities in scikit-learn; this is not implemented and not supported in API. It is a scikit-learn's limitation.

In general, according to scikit-learn's docs a loss function is of the form Loss(prediction, target), where prediction is the model's output, and target is the ground-truth value.

In the case of logistic regression, prediction is a value on (0,1) (i.e., a "soft label"), while target is 0 or 1 (i.e., a "hard label").


For logistic regression you can approximate probabilities as target by oversampling instances according to probabilities of their labels. e.g. if for given sample class_1 has probability 0.2, and class_2 has probability0.8, then generate 10 training instances (copied sample): 8 withclass_2as "ground truth target label" and 2 withclass_1`.

Obviously it is workaround and is not extremely efficient, but it should work properly.

If you're ok with upsampling approach, you can pip install eli5, and use eli5.lime.utils.fit_proba with a Logistic Regression classifier from scikit-learn.


Alternative solution is to implement (or find implementation?) of LogisticRegression in Tensorflow, where you can define loss function as you like it.


In compiling this solution I worked using answers from scikit-learn - multinomial logistic regression with probabilities as a target variable and scikit-learn classification on soft labels. I advise those for more insight.

like image 144
jo9k Avatar answered Nov 01 '22 18:11

jo9k