I have successfully ran a logistic regression model from the scikit-learn SGDClassifier package but cannot easily interpret the model's coefficients (accessed via SGDClassifier.coef_
) because the input data was transformed via scikit-learn's OneHotEncoder.
My original input data X
is of shape (12000,11):
X = np.array([[1,4,3...9,4,1],
[5,9,2...3,1,4],
...
[7,8,1...6,7,8]
])
I then applied one hot encoding:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
X_OHE = enc.fit_transform(X).toarray()
which produces an array of shape (12000, 696):
X_OHE = np.array([[1,0,1...0,0,1],
[0,0,0...0,1,0],
...
[1,0,1...0,0,1]
])
I then access the model's coefficients with SGDClassifier.coef_
which produces an array of shape (1,696):
coefs = np.array([[-1.233e+00,0.9123e+00,-2.431e+00...-0.238e+01,-1.33e+00,0.001e-01]])
How do I map the coefficient values back to the original values in X
, so I can say something like, "if variable foo
has a value of bar
, the target variable increases/decreases by bar_coeff
"?
Let me know if you need more info on the data or the model parameters. Thank you.
I found one unanswered question about this on SO: How to retrieve coefficient names after label encoding and one hot encoding on scikit-learn?
After reviewing this user's detailed explanation of OneHotEncoder
here, I was able to create a (somewhat hack-y) approach to relating model coefficients back to the original data set.
Assuming you've correctly setup your OneHotEncoder
:
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
enc = OneHotEncoder()
X_OHE = enc.fit_transform(X) # X and X_OHE as described in question
And you have successfully ran a GLM model, say:
from sklearn import linear_model
clf = linear_model.SGDClassifier()
clf.fit(X_train, y_train)
Which has coefficients clf.coef_
:
print clf.coef_
# np.array([[-1.233e+00,0.9123e+00,-2.431e+00...-0.238e+01,-1.33e+00,0.001e-01]])
You can use the below approach to trace the encoded 1's and 0's in X_OHE
back to the original values in X
. I'd recommend reading the mentioned detailed explanation on OneHotEncoding
(link at top), else the below will seem like gibberish. But in a nutshell, the below iterates over each feature
in X_OHE
and uses the feature_indices
parameter internal to enc
to make the translation.
import pandas as pd
import numpy as np
results = []
for i in range(enc.active_features_.shape[0]):
f = enc.active_features_[i]
index_range = np.extract(enc.feature_indices_ <= f, enc.feature_indices_)
s = len(index_range) - 1
f_index = index_range[-1]
f_label_decoded = f - f_index
results.append({
'label_decoded_value': f_label_decoded,
'coefficient': clf.coef_[0][i]
})
R = pd.DataFrame.from_records(results)
Where R looks like this (I original encoded the names of company departments):
coefficient label_decoded_value
3.929413 DepartmentFoo1
3.718078 DepartmentFoo2
3.101869 DepartmentFoo3
2.892845 DepartmentFoo4
...
So, now you can say, "The target variables increases by 3.929413 when an employee is in department 'Foo1'.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With