How to do GridSearchCV for F1-score in classification problem with scikit-learn?

I'm working on a multi classification problem with a neural network in scikit-learn and I'm trying to figure out how I can optimize my hyperparameters (amount of layers, perceptrons, other things eventually).

I found out that GridSearchCV is the way to do it but the code that I'm using returns me the average accuracy while I actually want to test on the F1-score. Does anyone have an idea about how I can edit this code to make it work for the F1-score?

In the beginning when I had to evaluate the precision/accuracy I thought it was 'enough' to just take the confusion matrix and make a conclusion out of it, while doing trial-and-error changing the amount of layers and perceptrons in my neural network again and again.

Today I figured out that there's more than that: GridSearchCV. I just need to figure out how i can evaluate the F1-score because I need to do a research on determining the accuracy from the neural network in terms of the layers, nodes, and eventually other alternatives...

mlp = MLPClassifier(max_iter=600)
clf = GridSearchCV(mlp, parameter_space, n_jobs= -1, cv = 3)
clf.fit(X_train, y_train.values.ravel())

parameter_space = {
    'hidden_layer_sizes': [(1), (2), (3)],

print('Best parameters found:\n', clf.best_params_)

means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))


Best parameters found:
 {'hidden_layer_sizes': 3}
0.842 (+/-0.089) for {'hidden_layer_sizes': 1}
0.882 (+/-0.031) for {'hidden_layer_sizes': 2}
0.922 (+/-0.059) for {'hidden_layer_sizes': 3}

So here my output gives me the mean accuracy (which I found is default on GridSearchCV). How can I change this to return the average F1-score instead of accuracy?

You can create your own metric function with make_scorer. In this case, you can use sklearn's f1_score, but you can use your own if you prefer:

from sklearn.metrics import f1_score, make_scorer

f1 = make_scorer(f1_score , average='macro')

Once you have made your scorer, you can plug it directly inside the grid creation as scoring parameter:

clf = GridSearchCV(mlp, parameter_space, n_jobs= -1, cv = 3, scoring=f1)

On the other hand, I've used average='macro' as f1 multi-class parameter. This calculates the metrics for each label, and then finds their unweighted mean. But there are other options in order to compute f1 with multiple labels. You can find them here

Note: answer completely edited for better understanding

