For some reason the order of the covariates seems to matter with a LogisticRegression
classifier in scikit-learn, which seems odd to me. I have 9 covariates and a binary output, and when I change the order of the columns and call fit()
and then call predict_proba()
the output is different. Toy example below
logit_model = LogisticRegression(C=1e9, tol=1e-15)
The following
logit_model.fit(df['column_2','column_1'],df['target'])
logit_model.predict_proba(df['column_2','column_1'])
array([[ 0.27387109, 0.72612891] ..])
Gives a different result to:
logit_model.fit(df['column_1','column_2'],df['target'])
logit_model.predict_proba(df['column_1','column_2'])
array([[ 0.26117794, 0.73882206], ..])
This seems surprising to me but maybe thats just my lack of knowledge about the internals of the algorithm and the fit method.
What am I missing?
EDIT: Here is the full code and data
data: https://s3-us-west-2.amazonaws.com/gjt-personal/test_model.csv
import pandas as pd
from sklearn.linear_model import LogisticRegression
df = pd.read_csv('test_model.csv',index_col=False)
columns1 =['col_1','col_2','col_3','col_4','col_5','col_6','col_7','col_8','col_9']
columns2 =['col_2','col_1','col_3','col_4','col_5','col_6','col_7','col_8','col_9']
logit_model = LogisticRegression(C=1e9, tol=1e-15)
logit_model.fit(df[columns1],df['target'])
logit_model.predict_proba(df[columns1])
logit_model.fit(df[columns2],df['target'])
logit_model.predict_proba(df[columns2])
Turns out its something to do with tol=1e-15
because this gives a different result.
LogisticRegression(C=1e9, tol=1e-15)
But this gives the same result.
LogisticRegression(C=1e9)
Thanks for adding sample data.
Taking a deeper look at your data it is clearly not standardized. If you were to apply a StandardScaler
to the dataset and try fitting again you will find that the prediction discrepancy disappears.
While this result is at least consistent it is still troubling that it raises a LineSearchWarning
and ConvergenceWarning
. To that I would say you do have an exceedingly low tolerance here at 1e-15
. Given the very high regularization penalty ratio (1e9
) you have applied, lowering tol
to the default 1e-4
will really have no impact whatsoever. This allows the model to properly converge and still produces the same outcome (in a much faster run time).
My full process looks like this:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
ss = StandardScaler()
cols1 = np.arange(9)
cols2 = np.array([1,0,2,3,4,5,6,7,8])
X = ss.fit_transform(df.drop('target', axis=1))
lr = LogisticRegression(solver='newton-cg', tol=1e-4, C=1e9)
lr.fit(X[:, cols1], df['target'])
preds_1 = lr.predict_proba(X[:, cols1])
lr.fit(X[:, cols2], df['target'])
preds_2 = lr.predict_proba(X[:, cols2])
preds_1
array([[ 0.00000000e+00, 1.00000000e+00],
[ 0.00000000e+00, 1.00000000e+00],
[ 0.00000000e+00, 1.00000000e+00],
...,
[ 1.00000000e+00, 9.09277801e-31],
[ 1.00000000e+00, 3.52079327e-35],
[ 1.00000000e+00, 5.99607407e-30]])
preds_2
array([[ 0.00000000e+00, 1.00000000e+00],
[ 0.00000000e+00, 1.00000000e+00],
[ 0.00000000e+00, 1.00000000e+00],
...,
[ 1.00000000e+00, 9.09277801e-31],
[ 1.00000000e+00, 3.52079327e-35],
[ 1.00000000e+00, 5.99607407e-30]])
The assertion preds_1 == preds_2
will fail, but the difference is on the order of 1e-40 + for each value, which I would say is well beyond any plausible level of significance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With