Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LogisticRegression scikit learn covariate (column) order matters on training

For some reason the order of the covariates seems to matter with a LogisticRegression classifier in scikit-learn, which seems odd to me. I have 9 covariates and a binary output, and when I change the order of the columns and call fit() and then call predict_proba() the output is different. Toy example below

logit_model = LogisticRegression(C=1e9, tol=1e-15)

The following

logit_model.fit(df['column_2','column_1'],df['target'])
logit_model.predict_proba(df['column_2','column_1'])

array([[ 0.27387109,  0.72612891] ..])

Gives a different result to:

logit_model.fit(df['column_1','column_2'],df['target'])
logit_model.predict_proba(df['column_1','column_2'])

array([[ 0.26117794,  0.73882206], ..])

This seems surprising to me but maybe thats just my lack of knowledge about the internals of the algorithm and the fit method.

What am I missing?

EDIT: Here is the full code and data

data: https://s3-us-west-2.amazonaws.com/gjt-personal/test_model.csv

import pandas as pd
from sklearn.linear_model import LogisticRegression

df = pd.read_csv('test_model.csv',index_col=False)

columns1 =['col_1','col_2','col_3','col_4','col_5','col_6','col_7','col_8','col_9']
columns2 =['col_2','col_1','col_3','col_4','col_5','col_6','col_7','col_8','col_9']

logit_model = LogisticRegression(C=1e9, tol=1e-15)

logit_model.fit(df[columns1],df['target'])
logit_model.predict_proba(df[columns1])

logit_model.fit(df[columns2],df['target'])
logit_model.predict_proba(df[columns2])

Turns out its something to do with tol=1e-15 because this gives a different result.

LogisticRegression(C=1e9, tol=1e-15)

But this gives the same result.

LogisticRegression(C=1e9)
like image 951
Glen Thompson Avatar asked Nov 07 '22 12:11

Glen Thompson


1 Answers

Thanks for adding sample data.

Taking a deeper look at your data it is clearly not standardized. If you were to apply a StandardScaler to the dataset and try fitting again you will find that the prediction discrepancy disappears.

While this result is at least consistent it is still troubling that it raises a LineSearchWarning and ConvergenceWarning. To that I would say you do have an exceedingly low tolerance here at 1e-15. Given the very high regularization penalty ratio (1e9) you have applied, lowering tol to the default 1e-4 will really have no impact whatsoever. This allows the model to properly converge and still produces the same outcome (in a much faster run time).

My full process looks like this:

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

ss = StandardScaler()
cols1 = np.arange(9)
cols2 = np.array([1,0,2,3,4,5,6,7,8])
X = ss.fit_transform(df.drop('target', axis=1))

lr = LogisticRegression(solver='newton-cg', tol=1e-4, C=1e9)
lr.fit(X[:, cols1], df['target'])
preds_1 = lr.predict_proba(X[:, cols1])

lr.fit(X[:, cols2], df['target'])
preds_2 = lr.predict_proba(X[:, cols2])

preds_1 
array([[  0.00000000e+00,   1.00000000e+00],
       [  0.00000000e+00,   1.00000000e+00],
       [  0.00000000e+00,   1.00000000e+00],
       ...,
       [  1.00000000e+00,   9.09277801e-31],
       [  1.00000000e+00,   3.52079327e-35],
       [  1.00000000e+00,   5.99607407e-30]])

preds_2
array([[  0.00000000e+00,   1.00000000e+00],
       [  0.00000000e+00,   1.00000000e+00],
       [  0.00000000e+00,   1.00000000e+00],
       ...,
       [  1.00000000e+00,   9.09277801e-31],
       [  1.00000000e+00,   3.52079327e-35],
       [  1.00000000e+00,   5.99607407e-30]])

The assertion preds_1 == preds_2 will fail, but the difference is on the order of 1e-40 + for each value, which I would say is well beyond any plausible level of significance.

like image 140
Grr Avatar answered Nov 14 '22 22:11

Grr