I have constructed a pipeline that takes a pandas dataframe that has been split into categorical and numerical columns. I am trying to run GridSearchCV on my results and ultimately look at the ranked features of importance for the best performing model that GridSearchCV selects. The problem I am encountering is that sklearn pipelines output numpy array objects and lose any column information along the way. Thus when I go to examine the most important coefficients of the model I am left with an unlabeled numpy array.
I have read that building a custom transformer might be a possible solution to this, but I do not have any experience doing so myself. I have also looked into leveraging the sklearn-pandas package, but I am hesitant to try and implement something that might not be updated in parallel with sklearn. Can anyone suggest what they believe is the best path to go about getting around this issue? I am also open to any literature that has hands on application of pandas and sklearn pipelines.
My Pipeline:
# impute and standardize numeric data
numeric_transformer = Pipeline([
('impute', SimpleImputer(missing_values=np.nan, strategy="mean")),
('scale', StandardScaler())
])
# impute and encode dummy variables for categorical data
categorical_transformer = Pipeline([
('impute', SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
('one_hot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
clf = Pipeline([
('transform', preprocessor),
('ridge', Ridge())
])
Cross Validation:
kf = KFold(n_splits=4, shuffle=True, random_state=44)
cross_val_score(clf, X_train, y_train, cv=kf).mean()
Grid Search:
param_grid = {
'ridge__alpha': [.001, .1, 1.0, 5, 10, 100]
}
gs = GridSearchCV(clf, param_grid, cv = kf)
gs.fit(X_train, y_train)
Examining Coefficients:
model = gs.best_estimator_
predictions = model.fit(X_train, y_train).predict(X_test)
model.named_steps['ridge'].coef_
Here is the output of the model coefficients as it currently stands when performed on the seaborn "mpg" dataset:
array([-4.64782052e-01, 1.47805207e+00, -3.28948689e-01, -5.37033173e+00,
2.80000700e-01, 2.71523808e+00, 6.29170887e-01, 9.51627968e-01,
...
-1.50574860e+00, 1.88477450e+00, 4.57285471e+00, -6.90459868e-01,
5.49416409e+00])
Ideally I would like to preserve the pandas dataframe information and retrieve the derived column names after OneHotEncoder and the other methods are called.
I would actually go for creating column names from the input. If your input is already divided into numerical an categorical you can use pd.get_dummies
to get the number of different category for each categorical feature.
Then you can just create proper names for the columns as shown in the last part of this working example based on the question with some artificial data.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
# create aritificial data
numeric_features_vals = pd.DataFrame({'x1': [1, 2, 3, 4], 'x2': [0.15, 0.25, 0.5, 0.45]})
numeric_features = ['x1', 'x2']
categorical_features_vals = pd.DataFrame({'cat1': [0, 1, 1, 2], 'cat2': [2, 1, 5, 0] })
categorical_features = ['cat1', 'cat2']
X_train = pd.concat([numeric_features_vals, categorical_features_vals], axis=1)
X_test = pd.DataFrame({'x1':[2,3], 'x2':[0.2, 0.3], 'cat1':[0, 1], 'cat2':[2, 1]})
y_train = pd.DataFrame({'labels': [10, 20, 30, 40]})
# impute and standardize numeric data
numeric_transformer = Pipeline([
('impute', SimpleImputer(missing_values=np.nan, strategy="mean")),
('scale', StandardScaler())
])
# impute and encode dummy variables for categorical data
categorical_transformer = Pipeline([
('impute', SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
('one_hot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
clf = Pipeline([
('transform', preprocessor),
('ridge', Ridge())
])
kf = KFold(n_splits=2, shuffle=True, random_state=44)
cross_val_score(clf, X_train, y_train, cv=kf).mean()
param_grid = {
'ridge__alpha': [.001, .1, 1.0, 5, 10, 100]
}
gs = GridSearchCV(clf, param_grid, cv = kf)
gs.fit(X_train, y_train)
model = gs.best_estimator_
predictions = model.fit(X_train, y_train).predict(X_test)
print('coefficients : ', model.named_steps['ridge'].coef_, '\n')
# create column names for categorical hot encoded data
columns_names_to_map = list(np.copy(numeric_features))
columns_names_to_map.extend('cat1_' + str(col) for col in pd.get_dummies(X_train['cat1']).columns)
columns_names_to_map.extend('cat2_' + str(col) for col in pd.get_dummies(X_train['cat2']).columns)
print('columns after preprocessing :', columns_names_to_map, '\n')
print('#'*80)
print( '\n', 'dataframe of rescaled features with custom colum names: \n\n', pd.DataFrame({col:vals for vals, col in zip (preprocessor.fit_transform(X_train).T, columns_names_to_map)}))
print('#'*80)
print( '\n', 'dataframe of ridge coefficients with custom colum names: \n\n', pd.DataFrame({col:vals for vals, col in zip (model.named_steps['ridge'].coef_.T, columns_names_to_map)}))
the code above (in the end) prints out the following dataframe which is a map from parameter name to parameter value:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With