I just started using MLFlow and I am happy with what it can do. However, I cannot find a way to log different runs in a GridSearchCV
from scikit learn.
For example, I can do this manually
params = ['l1', 'l2']
for param in params:
with mlflow.start_run(experiment_id=1):
clf = LogisticRegression(penalty = param).fit(X_train, y_train)
y_predictions = clf.predict(X_test)
precision = precision_score(y_test, y_predictions)
recall = recall_score(y_test, y_predictions)
f1 = f1_score(y_test, y_predictions)
mlflow.log_param("penalty", param)
mlflow.log_metric("Precision", precision)
mlflow.log_metric("Recall", recall)
mlflow.log_metric("F1", f1)
mlflow.sklearn.log_model(clf, "model")
But when I want to use the GridSearchCV
like that
pipe = Pipeline([('classifier' , RandomForestClassifier())])
param_grid = [
{'classifier' : [LogisticRegression()],
'classifier__penalty' : ['l1', 'l2'],
'classifier__C' : np.logspace(-4, 4, 20),
'classifier__solver' : ['liblinear']},
{'classifier' : [RandomForestClassifier()],
'classifier__n_estimators' : list(range(10,101,10)),
'classifier__max_features' : list(range(6,32,5))}
]
clf = GridSearchCV(pipe, param_grid = param_grid, cv = 5, verbose=True, n_jobs=-1)
best_clf = clf.fit(X_train, y_train)
I cannot think of any way to log all the individual models that the GridSearch tests. Is there any way to do it or I have to keep using the manual process?
I'd recommend hyperopt instead of scikit-learn's GridSearchCV
. Hyperopt can search the space with Bayesian optimization using hyperopt.tpe.suggest
. It will arrive at good parameters faster than a grid search and you can limit the number of iterations no matter the space size, so it's definitely better for large spaces. Since you're interested in the artifacts from the individual runs, you may prefer hyperopt's random search, which still has the advantage of being able to choose how many runs you perform.
You can parallelize the search very easily with Spark using hyperopt.SparkTrials
(here's a more complete example). Note that you can keep using scikit's cross validation, just put it inside the objective function (you can even keep track of the variance of the cross validation using loss_variance).
Now, to actually answer the question, I believe you can log the model, parameters, metrics, or whatever inside the objective function that you pass to hyperopt.fmin
. MLFlow will store each run as a child of the main run, and each run can have its own artifacts.
So you want something like this:
def objective(params):
metrics = ...
classifier = SomeClassifier(**params)
cv = cross_validate(classifier, X_train, y_train, scoring = metrics)
scores = {metric: cv[f'test_{metric}'] for metric in metrics}
# log all the stuff here
mlflow.log_metric('...', scores[...])
mlflow.sklearn.log_model(classifier.fit(X_train, y_train))
return scores['some_loss'].mean()
space = hp.choice(...)
trials = SparkTrials(parallelism = ...)
with mlflow.start_run() as run:
best_result = fmin(fn = objective, space = space, algo = tpe.suggest, max_evals = 100, trials = trials)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With