Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn.linear_model.LogisticRegression returns different coefficients every time although random_state is set

I'm fitting a logistic regression model and am setting the random state to a fixed value.

Every time I do a "fit" I get different coefficients, example:

classifier_instance.fit(train_examples_features, train_examples_labels)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=1, tol=0.0001)

>>> classifier_instance.raw_coef_
array([[ 0.071101940040772596  ,  0.05143724979709707323,  0.071101940040772596  , -0.04089477198935181912, -0.0407380696457252528 ,  0.03622160087086594843,  0.01055345545606742319,
         0.01071861708285645406, -0.36248634699444892693, -0.06159019047096317423,  0.02370064668025737009,  0.02370064668025737009, -0.03159781822495803805,  0.11221150783553821006,
         0.02728295348681779309,  0.071101940040772596  ,  0.071101940040772596  ,  0.                    ,  0.10882033432637286396,  0.64630314505709030026,  0.09617956519989406816,
         0.0604133873444507169 ,  0.                    ,  0.04111685986987245051,  0.                    ,  0.                    ,  0.18312324521915510078,  0.071101940040772596  ,
         0.071101940040772596  ,  0.                    , -0.59561802045324663268, -0.61490898457874587635,  1.07812569991461248975,  0.071101940040772596  ]])

classifier_instance.fit(train_examples_features, train_examples_labels)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=1, tol=0.0001)

>>> classifier_instance.raw_coef_
array([[ 0.07110193825129411394,  0.05143724970282205489,  0.07110193825129411394, -0.04089477178162870957, -0.04073806899140903354,  0.03622160048165772028,  0.010553455400928528  ,
         0.01071860364222424096, -0.36248635488413910588, -0.06159021545062405567,  0.02370064608376460866,  0.02370064608376460866, -0.03159783710841745225,  0.11221149816037970237,
         0.02728295411479400578,  0.07110193825129411394,  0.07110193825129411394,  0.                    ,  0.10882033461822394893,  0.64630314701686075729,  0.09617956493834901865,
         0.06041338563697066372,  0.                    ,  0.04111676713793514099,  0.                    ,  0.                    ,  0.18312324401049043243,  0.07110193825129411394,
         0.07110193825129411394,  0.                    , -0.59561803345113684127, -0.61490899867901249731,  1.07812569539027203191,  0.07110193825129411394]])

I'm using version 0.14, the docs specify "The underlying C implementation uses a random number generator to select features when fitting the model. It is thus not uncommon, to have slightly different results for the same input data. If that happens, try with a smaller tol parameter."

I thought that setting the random state would make sure there is no randomness but apparently this is not the case. Is this a bug or desired behavior?

like image 433
jonathans Avatar asked Jun 26 '14 07:06

jonathans


People also ask

What is LogisticRegression in Sklearn?

Photo Credit: Scikit-Learn. Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).

What does Linear_model LinearRegression () do?

LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation. Whether to calculate the intercept for this model.

Which is the best solver for logistic regression?

The solvers implemented in the class Logistic Regression are “liblinear”, “newton-cg”, “lbfgs”, “sag” and “saga”. In a nutshell, the following table summarizes the solvers characteristics: The “saga” solver is often the best choice.

How do you check Python logistic regression accuracy?

Quick Summary of the Logistic Regression Process Split the dataset into training and testing datasets. Fit the logistic regression model to the training dataset. Use the testing dataset with the model to predict testing dataset outcomes. Determine the accuracy of the model from these predictions.


2 Answers

It's not really desired, but it's a known issue that is very hard to fix. The thing is that LogisticRegression models are trained with Liblinear, which does not allow setting its random seed in a completely robust way. When you explicitly set the random_state, a best effort is made to set Liblinear's random seed, but that may fail.

like image 50
Fred Foo Avatar answered Sep 28 '22 21:09

Fred Foo


I was baffled by the problem as well, but eventually found that it was also necessary to call numpy.random.seed() to set the state of numpy's internal RNG, in addition to passing random_state.

This was tested with sklearn 0.13.1.

like image 24
Marcus Gröber Avatar answered Sep 28 '22 20:09

Marcus Gröber