Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to adjust scaled scikit-learn Logicistic Regression coeffs to score a non-scaled dataset?

I am currently using Scikit-Learn's LogisticRegression to build a model. I have used

from sklearn import preprocessing
scaler=preprocessing.StandardScaler().fit(build)
build_scaled = scaler.transform(build)

to scale all of my input variables prior to training the model. Everything works fine and produces a decent model, but my understanding is the coefficients produced by LogisticRegression.coeff_ are based on the scaled variables. Is there a transformation to those coefficients that can be used to adjust them to produce coefficients that can be applied to the non-scaled data?

I am thinking forward to am implementation of the model in a productionized system, and attempting to determine if all of the variables need to be pre-processed in some way in production for scoring of the model.

Note: the model will likely have to be re-coded within the production environment and the environment is not using python.

like image 793
caposeidon Avatar asked Jan 08 '23 02:01

caposeidon


2 Answers

You have to divide by the scaling you applied to normalise the feature, but also multiply by the scaling that you applied to the target.

Suppose

  • each feature variable x_i was scaled (divided) by scale_x_i

  • the target variable was scaled (divided) by scale_y

then

orig_coef_i = coef_i_found_on_scaled_data / scale_x_i * scale_y

Here's an example using pandas and sklearn LinearRegression

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

import numpy as np
import pandas as pd

boston = load_boston()
# Looking at the description of the data tells us the target variable name
# print boston.DESCR
data = pd.DataFrame(
    data = np.c_[boston.data, boston.target],
    columns = list(boston.feature_names) + ['MVAL'],
)
data.head()

X = boston.data
y = boston.target

lr = LinearRegression()
lr.fit(X,y)

orig_coefs = lr.coef_

coefs1 = pd.DataFrame(
    data={
        'feature': boston.feature_names, 
        'orig_coef' : orig_coefs, 
    }
)
coefs1

This shows us our coefficients for a linear regression with no scaling applied.

#  | feature| orig_coef
# 0| CRIM   | -0.107171
# 1| ZN     |  0.046395
# 2| INDUS  |  0.020860
# etc

We now normalise all our variables

# Now we normalise the data
scalerX = StandardScaler().fit(X)
scalery = StandardScaler().fit(y.reshape(-1,1)) # Have to reshape to avoid warnings

normed_X = scalerX.transform(X)
normed_y = scalery.transform(y.reshape(-1,1)) # Have to reshape to avoid warnings

normed_y = normed_y.ravel() # Turn y back into a vector again

# Check it's worked
# print np.mean(X, axis=0), np.mean(y, axis=0) # Should be 0s
# print np.std(X, axis=0), np.std(y, axis=0)   # Should be 1s

We can do the regression again on this normalised data...

# Now we redo our regression
lr = LinearRegression()
lr.fit(normed_X, normed_y)

coefs2 = pd.DataFrame(
    data={
        'feature' : boston.feature_names,
        'orig_coef' : orig_coefs,
        'norm_coef' : lr.coef_,
        'scaleX' : scalerX.scale_,
        'scaley' : scalery.scale_[0],
    },
    columns=['feature', 'orig_coef', 'norm_coef', 'scaleX', 'scaley']
)
coefs2

...and apply the scaling to get back our original coefficients

# We can recreate our original coefficients by dividing by the
# scale of the feature (scaleX) and multiplying by the scale
# of the target (scaleY)
coefs2['rescaled_coef'] = coefs2.norm_coef / coefs2.scaleX * coefs2.scaley
coefs2

When we do this we see that we have recreated our original coefficients.

#  | feature| orig_coef| norm_coef|    scaleX|   scaley| rescaled_coef
# 0| CRIM   | -0.107171| -0.100175|  8.588284| 9.188012| -0.107171
# 1| ZN     |  0.046395|  0.117651| 23.299396| 9.188012|  0.046395
# 2| INDUS  |  0.020860|  0.015560|  6.853571| 9.188012|  0.020860
# 3| CHAS   |  2.688561|  0.074249|  0.253743| 9.188012|  2.688561

For some machine learning methods, the target variable y must be normalised as well as the feature variables x. If you've done that, you need to include this "multiply by the scale of y" step as well as "divide by the scale of X_i" to get back the original regression coefficients.

Hope that helps

like image 50
Gareth Williams Avatar answered Jan 10 '23 16:01

Gareth Williams


Short answer, to get LogisticRegression coefficients and intercept for unscaled data (assuming binary classification, and lr is a trained LogisticRegression object):

  1. you must divide your coefficient array element wise by the (since v0.17) scaler.scale_ array: coefficients = np.true_divide(lr.coeff_, scaler.scale_)

  2. you must subtract from your intercept the inner product of the resulting coefficients (the division result) array with the scaler.mean_ array: intercept = lr.intercept_ - np.dot(coefficients, scaler.mean_)

you can see why the above needs to be done, if you think that every feature is normalized by substracting from it its mean (stored in the scaler.mean_ array) and then dividing it by its standard deviation (stored in the scaler.scale_ array).

like image 35
Pascal Antoniou Avatar answered Jan 10 '23 16:01

Pascal Antoniou