I'm trying to perform feature selection by evaluating my regressions coefficient outputs, and select the features with the highest magnitude coefficients. The problem is, I don't know how to get the respective features, as only coefficients are returned form the coef._ attribute. The documentation says:
Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features.
I am passing into my regression.fit(A,B), where A is a 2-D array, with tfidf value for each feature in a document. Example format:
"feature1" "feature2" "Doc1" .44 .22 "Doc2" .11 .6 "Doc3" .22 .2
B are my target values for the data, which are just numbers 1-100 associated with each document:
"Doc1" 50 "Doc2" 11 "Doc3" 99
Using regression.coef_, I get a list of coefficients, but not their corresponding features! How can I get the features? I'm guessing I need to modfy the structure of my B targets, but I don't know how.
The parameter α is called the constant or intercept, and represents the expected response when xi=0. (This quantity may not be of direct interest if zero is not in the range of the data.) The parameter β is called the slope, and represents the expected increment in the response per unit change in xi. Yi=α+βxi+ϵi.
What I found to work was:
X = your independent variables
coefficients = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(logistic.coef_))], axis = 1)
The assumption you stated: that the order of regression.coef_ is the same as in the TRAIN set holds true in my experiences. (works with the underlying data and also checks out with correlations between X and y)
You can do that by creating a data frame:
cdf = pd.DataFrame(regression.coef_, X.columns, columns=['Coefficients']) print(cdf)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With