I was wondering how the final model (i.e. decision boundary) of LogisticRegressionCV in sklearn was calculated. So say I have some Xdata and ylabels such that
Xdata # shape of this is (n_samples,n_features)
ylabels # shape of this is (n_samples,), and it is binary
and now I run
from sklearn.linear_model import LogisticRegressionCV
clf = LogisticRegressionCV(Cs=[1.0],cv=5)
clf.fit(Xdata,ylabels)
This is looking at just one regularization parameter and 5 folds in the CV. So clf.scores_
will be a dictionary with one key with a value that is an array with shape (n_folds,1). With these five folds you can get a better idea of how the model performs.
However, I'm confused about what you get from clf.coef_
(and I'm assuming the parameters in clf.coef_
are the ones used in clf.predict
). I have a few options I think it could be:
clf.coef_
are from training the model on all the dataclf.coef_
are from the best scoring foldclf.coef_
are averaged across the folds in some way.I imagine this is a duplicate question, but for the life of me I can't find a straightforward answer online, in the sklearn documentation, or in the source code for LogisticRegressionCV. Some relevant posts I found are:
You are mistaking between hyper-parameters and parameters. All scikit-learn estimators which have CV in the end, like LogisticRegressionCV
, GridSearchCV
, or RandomizedSearchCV
tune the hyper-parameters.
Hyper-parameters are not learnt from training on the data. They are set prior to learning assuming that they will contribute to optimal learning. More information is present here:
Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments to the constructor of the estimator classes. Typical examples include C, kernel and gamma for Support Vector Classifier, alpha for Lasso, etc.
In case of LogisticRegression, C
is a hyper-parameter which describes the inverse of regularization strength. The higher the C, the less regularization is applied on the training. Its not that C
will be changed during training. It will be fixed.
Now coming to coef_
. coef_
contains coefficient (also called weights) of the features, which are learnt (and updated) during the training. Now depending on the value of C (and other hyper-parameters present in contructor), these can vary during the training.
Now there is another topic on how to get the optimum initial values of coef_
, so that the training is faster and better. Thats optimization. Some start with random weights between 0-1, others start with 0, etc etc. But for the scope of your question, that is not relevant. LogisticRegressionCV is not used for that.
This is what LogisticRegressionCV does:
C
from constructor (In your example you passed 1.0).C
, do the cross-validation of supplied data, in which the LogisticRegression will be fit()
on training data of the current fold, and scored on the test data. The scores from test data of all folds are averaged and that becomes the score of the current C
. This is done for all C
values you provided, and the C
with the highest average score will be chosen.C
is set as the final C
and LogisticRegression is again trained (by calling fit()
) on the whole data (Xdata,ylabels
here).Thats what all the hyper-parameter tuners do, be it GridSearchCV, or LogisticRegressionCV, or LassoCV etc.
The initializing and updating of coef_
feature weights is done inside the fit()
function of the algorithm which is out of scope for the hyper-parameter tuning. That optimization part is dependent on the internal optimization algorithm of the process. For example solver
param in case of LogisticRegression
.
Hope this makes things clear. Feel free to ask if still any doubt.
You have the parameter refit=True
by default. On the docs you can read:
If set to True, the scores are averaged across all folds, and the coefs and the C that corresponds to the best score is taken, and a final refit is done using these parameters. Otherwise the coefs, intercepts and C that correspond to the best scores across folds are averaged.
So if refit=True
the CV model is retrained using all the data.
When it says the final refit is done using these parameters it is talking about the C regularization parameter. So it uses the C that gives the best
average score across the K folds.
When refit=False
it retrieves you the best model in cross validation.
So if you trained 5 folds, you will get the model (coeff + C + intercept), trained on 4 folds of data, which gave the best score on its fold test set.
I agree that the documetation here is not very clear but averaging C values and coefficients does not really make much sense
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With