Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up nested cross validation in python?

From what I've found there is 1 other question like this (Speed-up nested cross-validation) however installing MPI does not work for me after trying several fixes also suggested on this site and microsoft, so I am hoping there is another package or answer to this question.

I am looking to compare multiple algorithms and gridsearch a wide range of parameters (maybe too many parameters?), what ways are there besides mpi4py which could speed up running my code? As I understand it I cannot use n_jobs=-1 as that is then not nested?

Also to note, I have not been able to run this on the many parameters I am trying to look at below (runs longer than I have time). Only have results after 2 hours if I give each model only 2 parameters to compare. Also I run this code on a dataset of 252 rows and 25 feature columns with 4 categorical variables to predict ('certain', 'likely', 'possible', or 'unknown') whether a gene (with 252 genes) affects a disease. Using SMOTE increases the sample size to 420 which is then what goes into use.

dataset= pd.read_csv('data.csv')
data = dataset.drop(["gene"],1)
df = data.iloc[:,0:24]
df = df.fillna(0)
X = MinMaxScaler().fit_transform(df)

le = preprocessing.LabelEncoder()
encoded_value = le.fit_transform(["certain", "likely", "possible", "unlikely"])
Y = le.fit_transform(data["category"])

sm = SMOTE(random_state=100)
X_res, y_res = sm.fit_resample(X, Y)

seed = 7
logreg = LogisticRegression(penalty='l1', solver='liblinear',multi_class='auto')
LR_par= {'penalty':['l1'], 'C': [0.5, 1, 5, 10], 'max_iter':[500, 1000, 5000]}

rfc =RandomForestClassifier()
param_grid = {'bootstrap': [True, False],
              'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
              'max_features': ['auto', 'sqrt'],
              'min_samples_leaf': [1, 2, 4,25],
              'min_samples_split': [2, 5, 10, 25],
              'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}

mlp = MLPClassifier(random_state=seed)
parameter_space = {'hidden_layer_sizes': [(10,20), (10,20,10), (50,)],
     'activation': ['tanh', 'relu'],
     'solver': ['adam', 'sgd'],
     'max_iter': [10000],
     'alpha': [0.1, 0.01, 0.001],
     'learning_rate': ['constant','adaptive']}

gbm = GradientBoostingClassifier(min_samples_split=25, min_samples_leaf=25)
param = {"loss":["deviance"],
    "learning_rate": [0.15,0.1,0.05,0.01,0.005,0.001],
    "min_samples_split": [2, 5, 10, 25],
    "min_samples_leaf": [1, 2, 4,25],
    "max_depth":[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
    "max_features":['auto', 'sqrt'],
    "criterion": ["friedman_mse"],
    "n_estimators":[200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
    }

svm = SVC(gamma="scale", probability=True)
tuned_parameters = {'kernel':('linear', 'rbf'), 'C':(1,0.25,0.5,0.75)}

def baseline_model(optimizer='adam', learn_rate=0.01):
    model = Sequential()
    model.add(Dense(100, input_dim=X_res.shape[1], activation='relu')) 
    model.add(Dropout(0.5))
    model.add(Dense(50, activation='relu')) #8 is the dim/ the number of hidden units (units are the kernel)
    model.add(Dense(4, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

keras = KerasClassifier(build_fn=baseline_model, batch_size=32, epochs=100, verbose=0)
learn_rate = [0.001, 0.01, 0.1, 0.2, 0.3]
optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
kerasparams = dict(optimizer=optimizer, learn_rate=learn_rate)

inner_cv = KFold(n_splits=10, shuffle=True, random_state=seed)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=seed)

models = []
models.append(('GBM', GridSearchCV(gbm, param, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('RFC', GridSearchCV(rfc, param_grid, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('LR', GridSearchCV(logreg, LR_par, cv=inner_cv, iid=False, n_jobs=1)))
models.append(('SVM', GridSearchCV(svm, tuned_parameters, cv=inner_cv, iid=False, n_jobs=1)))
models.append(('MLP', GridSearchCV(mlp, parameter_space, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('Keras', GridSearchCV(estimator=keras, param_grid=kerasparams, cv=inner_cv,iid=False, n_jobs=1)))


results = []
names = []
scoring = 'accuracy'
X_train, X_test, Y_train, Y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=0)


for name, model in models:
    nested_cv_results = model_selection.cross_val_score(model, X_res, y_res, cv=outer_cv, scoring=scoring)
    results.append(nested_cv_results)
    names.append(name)
    msg = "Nested CV Accuracy %s: %f (+/- %f )" % (name, nested_cv_results.mean()*100, nested_cv_results.std()*100)
    print(msg)
    model.fit(X_train, Y_train)
    print('Test set accuracy: {:.2f}'.format(model.score(X_test, Y_test)*100),  '%')
    print("Best Parameters: \n{}\n".format(model.best_params_))
    print("Best CV Score: \n{}\n".format(model.best_score_))

As an example, most of the dataset is binary and looks like this:

gene   Tissue    Druggable Eigenvalue CADDvalue Catalogpresence   Category
ACE      1           1         1          0           1            Certain
ABO      1           0         0          0           0            Likely
TP53     1           1         0          0           0            Possible

Any guidance on how I could speed this up would be appreciated.

Edit: I have also tried using parallel processing with dask, but I am not sure I am doing it right, and it doesn't seem to run any faster:

for name, model in models:
    with joblib.parallel_backend('dask'):
        nested_cv_results = model_selection.cross_val_score(model, X_res, y_res, cv=outer_cv, scoring=scoring)
        results.append(nested_cv_results)
        names.append(name)
        msg = "Nested CV Accuracy %s: %f (+/- %f )" % (name, nested_cv_results.mean()*100, nested_cv_results.std()*100)
        print(msg)
        model.fit(X_train, Y_train)
        print('Test set accuracy: {:.2f}'.format(model.score(X_test, Y_test)*100),  '%')
    #print("Best Estimator: \n{}\n".format(model.best_estimator_))
        print("Best Parameters: \n{}\n".format(model.best_params_))
        print("Best CV Score: \n{}\n".format(model.best_score_)) #average of all cv folds for a single combination of the parameters you specify 

Edit: also to note with reducing the gridsearch, I have tried with for example 5 parameters per model however this still takes several hours to complete, so whilst trimming down the number will be helpful, if there is any advice for efficency beyond that I would be grateful.

like image 825
DN1 Avatar asked Apr 23 '19 09:04

DN1


People also ask

How do you speed up cross-validation?

You can improve the performance of the cross-validation step in SparkML to speed things up: Cache the data before running any feature transformations or modeling steps, including cross-validation. Processes that refer to the data multiple times benefit from a cache.

What is a drawback with using nested cross-validation?

The problem is that if this score alone is used to then select a model, or the same dataset is used to evaluate the tuned models, then the selection process will be biased by this inadvertent overfitting. The result is an overly optimistic estimate of model performance that does not generalize to new data.

Is nested cross-validation necessary?

Model selection without nested cross-validation uses the same data to tune model parameters and evaluate model performance that may lead to an optimistically biased evaluation of the model. We get a poor estimation of errors in training or test data due to information leakage.


3 Answers

Two things:

  1. Instead of GridSearch try using HyperOpt - it's a Python library for serial and parallel optimization.

  2. I would reduce the dimensionality by using UMAP or PCA. Probably UMAP is the better choice.

After you apply SMOTE:

import umap

dim_reduced = umap.UMAP(
        min_dist=min_dist,
        n_neighbors=neighbours,
        random_state=1234,
    ).fit_transform(smote_output)

And then you can use dim_reduced for the train test split.

Reducing the dimensionality will help to remove noise from the data and instead of dealing with 25 features you'll bring them down to 2 (using UMAP) or the number of components you choose (using PCA). Which should have significant implications on the performance.

like image 37
VnC Avatar answered Oct 20 '22 06:10

VnC


The Dask-ML has scalable implementations GridSearchCV and RandomSearchCV that are, I believe, drop in replacements for Scikit-Learn. They were developed alongside Scikit-Learn developers.

  • https://ml.dask.org/hyper-parameter-search.html

They can be faster for two reasons:

  • They avoid repeating shared work between different stages of a Pipeline
  • They can scale out to a cluster anywhere you can deploy Dask (which is easy on most cluster infrastructure)
like image 155
MRocklin Avatar answered Oct 20 '22 06:10

MRocklin


There is an easy win in your situation and that is .... start using parallel processing :). dask will help you if you have a cluster (it will work on a single machine, but the improvement compared to the default scheduling in sklearn is not significant), but if you plan to run it on a single machine (but have several cores/threads and "enough" memory) then you can run nested CV in parallel. The only trick is that sklearn will not allow you to run the outer CV loop in multiple processes. However, it will allow you to run the inner loop in multiple threads.

At the moment you have n_jobs=None in the outer CV loop (that's the default in cross_val_score), which means n_jobs=1 and that's the only option that you can use with sklearn in the nested CV.

However, you can achieve and easy gain by setting n_jobs=some_reasonable_number in all GridSearchCV that you use. some_reasonable_number does not have to be -1 (but it is a good starting point). Some algorithms either plateau on n_jobs=n_cores instead of n_threads (for example, xgboost) or already have built-in multi-processing (like RandomForestClassifier, for example) and there might be clashes if you spawn too many processes.

like image 44
Mischa Lisovyi Avatar answered Oct 20 '22 05:10

Mischa Lisovyi