Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Integrate Keras to SKLearn Pipeline?

I have a sklearn pipeline performing feature engineering on heterogeneous data types (boolean, categorical, numeric, text) and wanted to try a neural network as my learning algorithm to fit the model. I am running into some problems with the shape of the input data.

I am wondering if what I am trying to do is even possible and or if I should try a different approach?

I have tried a couple different methods but am receiving these errors:

  1. Error when checking input: expected dense_22_input to have shape (11,) but got array with shape (30513,) => I have 11 input features ... so I then tried converting my X and y to arrays and now get this error

  2. ValueError: Specifying the columns using strings is only supported for pandas DataFrames => which I think is because of the ColumnTransformer() where I specify column names

print(X_train_OS.shape)
print(y_train_OS.shape)

(22354, 11)
(22354,)
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import to_categorical # OHE

X_train_predictors = df_train_OS.drop("label", axis=1)
X_train_predictors = X_train_predictors.values
y_train_target = to_categorical(df_train_OS["label"])

y_test_predictors = test_set.drop("label", axis=1)
y_test_predictors = y_test_predictors.values
y_test_target = to_categorical(test_set["label"])

print(X_train_predictors.shape)
print(y_train_target.shape)

(22354, 11)
(22354, 2)
def keras_classifier_wrapper():
    clf = Sequential()
    clf.add(Dense(32, input_dim=11, activation='relu'))
    clf.add(Dense(2, activation='softmax'))
    clf.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
    return clf

TOKENS_ALPHANUMERIC_HYPHEN = "[A-Za-z0-9\-]+(?=\\s+)"

boolTransformer = Pipeline(steps=[
    ('bool', PandasDataFrameSelector(BOOL_FEATURES))])

catTransformer = Pipeline(steps=[
    ('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('cat_ohe', OneHotEncoder(handle_unknown='ignore'))])

numTransformer = Pipeline(steps=[
    ('num_imputer', SimpleImputer(strategy='constant', fill_value=0)),
    ('num_scaler', StandardScaler())])

textTransformer_0 = Pipeline(steps=[
    ('text_bow', CountVectorizer(lowercase=True,\
                                 token_pattern=TOKENS_ALPHANUMERIC_HYPHEN,\
                                 stop_words=stopwords))])

textTransformer_1 = Pipeline(steps=[
    ('text_bow', CountVectorizer(lowercase=True,\
                                 token_pattern=TOKENS_ALPHANUMERIC_HYPHEN,\
                                 stop_words=stopwords))])

FE = ColumnTransformer(
    transformers=[
        ('bool', boolTransformer, BOOL_FEATURES),
        ('cat', catTransformer, CAT_FEATURES),
        ('num', numTransformer, NUM_FEATURES),
        ('text0', textTransformer_0, TEXT_FEATURES[0]),
        ('text1', textTransformer_1, TEXT_FEATURES[1])])

clf = KerasClassifier(keras_classifier_wrapper, epochs=100, batch_size=500, verbose=0)

PL = Pipeline(steps=[('feature_engineer', FE),
                     ('keras_clf', clf)])

PL.fit(X_train_predictors, y_train_target)
#PL.fit(X_train_OS, y_train_OS)

I think I understand the problem here however not sure how to solve it. If it is not possible to integrate sklearn ColumnTransformer+Pipeline into Keras model does Keras have a good way for dealing with fixed data types to feature engineer? Thank you!

like image 609
thePurplePython Avatar asked May 02 '19 15:05

thePurplePython


People also ask

Does keras work with sklearn?

sklearn is Python's general purpose machine learning library, and it features a lot of utilities not just for building learners but for pipelining and structuring them as well. keras models don't work with sklearn out of the box, but they can be made compatible quite easily.

Is TensorFlow compatible with sklearn?

Since Scikit-Learn allows you to implement your own estimators, there's nothing stopping you from using TensorFlow within Scikit-Learn's framework to compare TensorFlow models against other Scikit-Learn models.

What is pipeline in keras?

Keras and Pipelines can be categorized as "Machine Learning" tools. Keras and Pipelines are both open source tools. It seems that Keras with 42.5K GitHub stars and 16.2K forks on GitHub has more adoption than Pipelines with 944 GitHub stars and 247 GitHub forks. Decisions including Keras & Pipelines. Fabian Ulmer.


1 Answers

It looks like you are passing your 11 columns of original data through your various column transformers and the number of dimensions is expanding to 30,513 (after count vectorizing your text, one hot encoding etc). Your neural network architecture is set up to accept only 11 input features but is being passed your (now transformed) 30,513 features, which is what error 1 is explaining.

You therefore need to amend the input_dim of your neural network to match the number of features being created in the feature extraction pipeline.

One thing you could do is add an intermediate step between them with something like SelectKBest and set that to something like 20,000 so that you know exactly how many features will eventually be passed to the classifier.

This is a good guide and flowchart on the Google machine learning website - link - look at the flow chart - here you can see they have a 'select top k features' step in the pipeline before training a model.

So, try updating these parts of your code to:

def keras_classifier_wrapper():
    clf = Sequential()
    clf.add(Dense(32, input_dim=20000, activation='relu'))
    clf.add(Dense(2, activation='softmax'))
    clf.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
    return clf

and

from sklearn.feature_selection import SelectKBest
select_best_features = SelectKBest(k=20000)

PL = Pipeline(steps=[('feature_engineer', FE),
                     ('select_k_best', select_best_features),
                     ('keras_clf', clf)])
like image 163
Matt Avatar answered Oct 11 '22 02:10

Matt