Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perform GridSearchCV with MLFlow

I just started using MLFlow and I am happy with what it can do. However, I cannot find a way to log different runs in a GridSearchCV from scikit learn.

For example, I can do this manually

params = ['l1', 'l2']
for param in params:
    with mlflow.start_run(experiment_id=1):
        clf = LogisticRegression(penalty = param).fit(X_train, y_train)
        y_predictions = clf.predict(X_test)

        precision = precision_score(y_test, y_predictions)
        recall = recall_score(y_test, y_predictions)
        f1 = f1_score(y_test, y_predictions)

        mlflow.log_param("penalty", param)
        mlflow.log_metric("Precision", precision)
        mlflow.log_metric("Recall", recall)
        mlflow.log_metric("F1", f1)

        mlflow.sklearn.log_model(clf, "model")

But when I want to use the GridSearchCV like that

pipe = Pipeline([('classifier' , RandomForestClassifier())])

param_grid = [
    {'classifier' : [LogisticRegression()],
     'classifier__penalty' : ['l1', 'l2'],
    'classifier__C' : np.logspace(-4, 4, 20),
    'classifier__solver' : ['liblinear']},
    {'classifier' : [RandomForestClassifier()],
    'classifier__n_estimators' : list(range(10,101,10)),
    'classifier__max_features' : list(range(6,32,5))}
]


clf = GridSearchCV(pipe, param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1)

best_clf = clf.fit(X_train, y_train)

I cannot think of any way to log all the individual models that the GridSearch tests. Is there any way to do it or I have to keep using the manual process?

like image 996
Tasos Avatar asked Apr 02 '20 15:04

Tasos


1 Answers

I'd recommend hyperopt instead of scikit-learn's GridSearchCV. Hyperopt can search the space with Bayesian optimization using hyperopt.tpe.suggest. It will arrive at good parameters faster than a grid search and you can limit the number of iterations no matter the space size, so it's definitely better for large spaces. Since you're interested in the artifacts from the individual runs, you may prefer hyperopt's random search, which still has the advantage of being able to choose how many runs you perform.

You can parallelize the search very easily with Spark using hyperopt.SparkTrials (here's a more complete example). Note that you can keep using scikit's cross validation, just put it inside the objective function (you can even keep track of the variance of the cross validation using loss_variance).

Now, to actually answer the question, I believe you can log the model, parameters, metrics, or whatever inside the objective function that you pass to hyperopt.fmin. MLFlow will store each run as a child of the main run, and each run can have its own artifacts.

So you want something like this:

def objective(params):
    metrics = ...
    classifier = SomeClassifier(**params)
    cv = cross_validate(classifier, X_train, y_train, scoring = metrics)
    scores = {metric: cv[f'test_{metric}'] for metric in metrics}
    # log all the stuff here
    mlflow.log_metric('...', scores[...])
    mlflow.sklearn.log_model(classifier.fit(X_train, y_train))
    return scores['some_loss'].mean()

space = hp.choice(...)
trials = SparkTrials(parallelism = ...)
with mlflow.start_run() as run:
    best_result = fmin(fn = objective, space = space, algo = tpe.suggest, max_evals = 100, trials = trials)
like image 186
l_l_l_l_l_l_l_l Avatar answered Oct 11 '22 10:10

l_l_l_l_l_l_l_l