Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Retrieve Original Variables After Scikit Model Run w/OneHotEncoding

I have successfully ran a logistic regression model from the scikit-learn SGDClassifier package but cannot easily interpret the model's coefficients (accessed via SGDClassifier.coef_) because the input data was transformed via scikit-learn's OneHotEncoder.

My original input data X is of shape (12000,11):

X = np.array([[1,4,3...9,4,1],
              [5,9,2...3,1,4],
              ...
              [7,8,1...6,7,8]
              ])

I then applied one hot encoding:

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
X_OHE = enc.fit_transform(X).toarray()

which produces an array of shape (12000, 696):

X_OHE = np.array([[1,0,1...0,0,1],
                 [0,0,0...0,1,0],
                  ...
                 [1,0,1...0,0,1]
                 ])

I then access the model's coefficients with SGDClassifier.coef_ which produces an array of shape (1,696):

coefs = np.array([[-1.233e+00,0.9123e+00,-2.431e+00...-0.238e+01,-1.33e+00,0.001e-01]])

How do I map the coefficient values back to the original values in X, so I can say something like, "if variable foo has a value of bar, the target variable increases/decreases by bar_coeff"?

Let me know if you need more info on the data or the model parameters. Thank you.

I found one unanswered question about this on SO: How to retrieve coefficient names after label encoding and one hot encoding on scikit-learn?

like image 290
NickBraunagel Avatar asked Jul 11 '17 17:07

NickBraunagel


1 Answers

After reviewing this user's detailed explanation of OneHotEncoder here, I was able to create a (somewhat hack-y) approach to relating model coefficients back to the original data set.

Assuming you've correctly setup your OneHotEncoder:

from sklearn.preprocessing import OneHotEncoder
from scipy import sparse

enc = OneHotEncoder()
X_OHE = enc.fit_transform(X)   # X and X_OHE as described in question

And you have successfully ran a GLM model, say:

from sklearn import linear_model

clf = linear_model.SGDClassifier()
clf.fit(X_train, y_train)

Which has coefficients clf.coef_:

print clf.coef_
# np.array([[-1.233e+00,0.9123e+00,-2.431e+00...-0.238e+01,-1.33e+00,0.001e-01]])

You can use the below approach to trace the encoded 1's and 0's in X_OHE back to the original values in X. I'd recommend reading the mentioned detailed explanation on OneHotEncoding (link at top), else the below will seem like gibberish. But in a nutshell, the below iterates over each feature in X_OHE and uses the feature_indices parameter internal to enc to make the translation.

import pandas as pd
import numpy as np
results = []

for i in range(enc.active_features_.shape[0]):
    f = enc.active_features_[i]

    index_range = np.extract(enc.feature_indices_ <= f, enc.feature_indices_)
    s = len(index_range) - 1
    f_index = index_range[-1]
    f_label_decoded = f - f_index

    results.append({
            'label_decoded_value': f_label_decoded,
            'coefficient': clf.coef_[0][i]
        })

R = pd.DataFrame.from_records(results)

Where R looks like this (I original encoded the names of company departments):

coefficient label_decoded_value
3.929413    DepartmentFoo1
3.718078    DepartmentFoo2
3.101869    DepartmentFoo3
2.892845    DepartmentFoo4
...

So, now you can say, "The target variables increases by 3.929413 when an employee is in department 'Foo1'.

like image 108
NickBraunagel Avatar answered Sep 28 '22 05:09

NickBraunagel