Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CATBoost and GridSearch

model.fit(train_data, y=label_data, eval_set=eval_dataset)
eval_dataset = Pool(val_data, val_labels)
model = CatBoostClassifier(depth=8 or 10, iterations=10, task_type="GPU", devices='0-2', eval_metric='Accuracy', boosting_type="Ordered", bagging_temperature=0, use_best_model=True)

When I run the code above (in 2 separate runs / depth set to 8 or 10) I get the following results:

Depth 10: 0.6864865 Depth 8: 0.6756757

I would like to setup and run GridSearch in a way - so it runs exactly the same combinations and produces the exact same results - as when I run the code manually.

GridSearch code:

model = CatBoostClassifier(iterations=10, task_type="GPU", devices='0-2', eval_metric='Accuracy', boosting_type="Ordered", depth=10, bagging_temperature=0, use_best_model=True)

grid = {'depth': [8,10]}
grid_search_result = GridSearchCV(model, grid, cv=2)
results = grid_search_result.fit(train_data, y=label_data, eval_set=eval_dataset) 

Issues:

  1. I would like the GridSearch to use my "eval_set" to compare/validate all the different runs (like when run manually) - But it uses something else which I don't understand what is and it doesn't seems to look at "eval_set" at all?

  2. It produces not only 2 results - but depending on the "cv" (The cross-validation splitting strategy.) param it run 3,5,7,9 or 11 runs? I don't want that.

  3. I tried to go through the entire "results" object via the debugger - but I simply can't find the validation "Accuracy" scores for the best or all the other runs. I can find a lot of other values - but none of them matches what I'm looking for. The numbers don't match the numbers the "eval_set" dataset produces?

I solved my issue by implementing my own simple GridSearch (In case it can help/ inspire others :-) ): Please let my know if you have any comments to the code :-)

import pandas as pd
from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import GridSearchCV
import csv
from datetime import datetime

# Initialize data

train_data = pd.read_csv('./train_x.csv')
label_data = pd.read_csv('./labels_train_x.csv')
val_data = pd.read_csv('./val_x.csv')
val_labels = pd.read_csv('./labels_val_x.csv')

eval_dataset = Pool(val_data, val_labels)

ite = [1000,2000]
depth = [6,7,8,9,10]
max_bin = [None,32,46,100,254]
l2_leaf_reg = [None,2,10,20,30]
bagging_temperature = [None,0,0.5,1]
random_strength = [None,1,5,10]
total_runs = len(ite) * len(depth) * len(max_bin) * len(l2_leaf_reg) * len(bagging_temperature) * len(random_strength)

print('Total runs: ' + str(total_runs))

counter = 0

file_name = './Results/Catboost_' + str(datetime.now().strftime("%d_%m_%Y_%H_%M_%S")) + '.csv'

row = ['Validation Accuray','Logloss','Iterations', 'Depth', 'Max_bin', 'L2_leaf_reg', 'Bagging_temperature', 'Random_strength']
with open(file_name, 'a') as csvFile:
    writer = csv.writer(csvFile)
    writer.writerow(row)
csvFile.close()

for a in ite:
    for b in depth:
        for c in max_bin:
            for d in l2_leaf_reg:
                for e in bagging_temperature:
                    for f in random_strength:
                        model = CatBoostClassifier(task_type="GPU", devices='0-2', eval_metric='Accuracy', boosting_type="Ordered", use_best_model=True,
                        iterations=a, depth=b, max_bin=c, l2_leaf_reg=d, bagging_temperature=e, random_strength=f)
                        counter += 1
                        print('Run # ' + str(counter) + '/' + str(total_runs))
                        result = model.fit(train_data, y=label_data, eval_set=eval_dataset, verbose=1)

                        accuracy = float(result.best_score_['validation']['Accuracy'])
                        logLoss = result.best_score_['validation']['Logloss']

                        row = [ accuracy, logLoss,
                                ('Auto' if a == None else a),
                                ('Auto' if b == None else b),
                                ('Auto' if c == None else c),
                                ('Auto' if d == None else d),
                                ('Auto' if e == None else e),
                                ('Auto' if f == None else f)]

                        with open(file_name, 'a') as csvFile:
                            writer = csv.writer(csvFile)
                            writer.writerow(row)
                        csvFile.close()
like image 729
PabloDK Avatar asked Dec 02 '25 03:12

PabloDK


1 Answers

The eval set in Catboost is acting as a holdout set.

In GridSearchCV the cv is performed on your train_data.

One solution would be to merge your train_data and eval_dataset and pass the index of train and eval in GridSearchCV. Try to yield both sets of index in the cv param. Then you will have just one split and Accuracy numbers that will give you the same results.

like image 112
Quantic Avatar answered Dec 04 '25 15:12

Quantic



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!