The target variable that I need to predict are probabilities
(as opposed to labels). The corresponding column in my training data are also in this form. I do not want to lose information by thresholding the targets to create a classification problem out of it.
If I train the logistic regression classifier
with binary labels, sk-learn logistic regression API allows obtaining the probabilities at prediction time. However, I need to train it with probabilities. Is there a way to do this in scikits-learn, or a suitable Python package that scales to 100K data points of 1K dimension.
I want the regressor to use the structure of the problem. One such structure is that the targets are probabilities.
You can't have cross-entropy loss with non-indicator probabilities in scikit-learn
; this is not implemented and not supported in API. It is a scikit-learn
's limitation.
In general, according to scikit-learn
's docs a loss function is of the form Loss(prediction, target)
, where prediction is the model's output, and target is the ground-truth value.
In the case of logistic regression, prediction is a value on (0,1)
(i.e., a "soft label"), while target is 0
or 1
(i.e., a "hard label").
For logistic regression you can approximate probabilities as target by oversampling instances according to probabilities of their labels. e.g. if for given sample class_1
has probability 0.2
, and class_2 has probability
0.8, then generate 10 training instances (copied sample): 8 with
class_2as "ground truth target label" and 2 with
class_1`.
Obviously it is workaround and is not extremely efficient, but it should work properly.
If you're ok with upsampling approach, you can pip install eli5
, and use eli5.lime.utils.fit_proba
with a Logistic Regression classifier
from scikit-learn
.
Alternative solution is to implement (or find implementation?) of LogisticRegression
in Tensorflow, where you can define loss function as you like it.
In compiling this solution I worked using answers from scikit-learn - multinomial logistic regression with probabilities as a target variable and scikit-learn classification on soft labels. I advise those for more insight.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With