Perform GridSearchCV with MLFlow

Question

I just started using MLFlow and I am happy with what it can do. However, I cannot find a way to log different runs in a GridSearchCV from scikit learn.

For example, I can do this manually

params = ['l1', 'l2']
for param in params:
    with mlflow.start_run(experiment_id=1):
        clf = LogisticRegression(penalty = param).fit(X_train, y_train)
        y_predictions = clf.predict(X_test)

        precision = precision_score(y_test, y_predictions)
        recall = recall_score(y_test, y_predictions)
        f1 = f1_score(y_test, y_predictions)

        mlflow.log_param("penalty", param)
        mlflow.log_metric("Precision", precision)
        mlflow.log_metric("Recall", recall)
        mlflow.log_metric("F1", f1)

        mlflow.sklearn.log_model(clf, "model")

But when I want to use the GridSearchCV like that

pipe = Pipeline([('classifier' , RandomForestClassifier())])

param_grid = [
    {'classifier' : [LogisticRegression()],
     'classifier__penalty' : ['l1', 'l2'],
    'classifier__C' : np.logspace(-4, 4, 20),
    'classifier__solver' : ['liblinear']},
    {'classifier' : [RandomForestClassifier()],
    'classifier__n_estimators' : list(range(10,101,10)),
    'classifier__max_features' : list(range(6,32,5))}
]


clf = GridSearchCV(pipe, param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1)

best_clf = clf.fit(X_train, y_train)

I cannot think of any way to log all the individual models that the GridSearch tests. Is there any way to do it or I have to keep using the manual process?

l_l_l_l_l_l_l_l · Accepted Answer

I'd recommend hyperopt instead of scikit-learn's GridSearchCV. Hyperopt can search the space with Bayesian optimization using hyperopt.tpe.suggest. It will arrive at good parameters faster than a grid search and you can limit the number of iterations no matter the space size, so it's definitely better for large spaces. Since you're interested in the artifacts from the individual runs, you may prefer hyperopt's random search, which still has the advantage of being able to choose how many runs you perform.

You can parallelize the search very easily with Spark using hyperopt.SparkTrials (here's a more complete example). Note that you can keep using scikit's cross validation, just put it inside the objective function (you can even keep track of the variance of the cross validation using loss_variance).

Now, to actually answer the question, I believe you can log the model, parameters, metrics, or whatever inside the objective function that you pass to hyperopt.fmin. MLFlow will store each run as a child of the main run, and each run can have its own artifacts.

So you want something like this:

def objective(params):
    metrics = ...
    classifier = SomeClassifier(**params)
    cv = cross_validate(classifier, X_train, y_train, scoring = metrics)
    scores = {metric: cv[f'test_{metric}'] for metric in metrics}
    # log all the stuff here
    mlflow.log_metric('...', scores[...])
    mlflow.sklearn.log_model(classifier.fit(X_train, y_train))
    return scores['some_loss'].mean()

space = hp.choice(...)
trials = SparkTrials(parallelism = ...)
with mlflow.start_run() as run:
    best_result = fmin(fn = objective, space = space, algo = tpe.suggest, max_evals = 100, trials = trials)

Perform GridSearchCV with MLFlow

Tags:

python

scikit-learn

mlflow

Tasos

1 Answers

l_l_l_l_l_l_l_l

Recent Activity

Donate For Us

Perform GridSearchCV with MLFlow

Tags:

python

scikit-learn

mlflow

Tasos

1 Answers

l_l_l_l_l_l_l_l

Related questions

Recent Activity

Donate For Us