Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using GridSearchCV best_params_ gives poor results

I'm trying to tune hyperparameters for KNN on a quite small datasets ( Kaggle Leaf which has around 990 lines ):

def knnTuning(self, x_train, t_train):
    
    params = {
        'n_neighbors': [1, 2, 3, 4, 5, 7, 9],
        'weights': ['uniform', 'distance'],
        'leaf_size': [5,10, 15, 20]
    }
    grid = GridSearchCV(KNeighborsClassifier(), params)
    grid.fit(x_train, t_train)
    
    print(grid.best_params_)
    print(grid.best_score_)
    
    return knn.KNN(neighbors=grid.best_params_["n_neighbors"], 
                   weight = grid.best_params_["weights"],
                   leafSize = grid.best_params_["leaf_size"])

Prints:
{'leaf_size': 5, 'n_neighbors': 1, 'weights': 'uniform'}
0.9119999999999999

And I return this classifier

class KNN:

def __init__(self, neighbors=1, weight = 'uniform', leafSize = 10):
    
    self.clf = KNeighborsClassifier(n_neighbors = neighbors,
                                    weights = weight, leaf_size = leafSize)

def train(self, X, t):
    self.clf.fit(X, t)

def predict(self, x):
    return self.clf.predict(x)

def global_accuracy(self, X, t):
    predicted = self.predict(X)
    accuracy = (predicted == t).mean()
    
    return accuracy

I run this several time using 700 lines for the training and 200 for validation, which are chosen with random permutation.

I then got result for the global accuracy from 0.01 (often) to 0.4 (rarely).

I know that i'm not comparing two same metrics but I still can't understand the huge difference between the results.

like image 356
Timothee W Avatar asked Dec 05 '25 10:12

Timothee W


1 Answers

Not very sure how you trained your model or how the preprocessing was done. The leaf dataset has about 100 labels (species) so you have to take care to split your test and train to ensure an even split of your samples. One reason for the weird accuracy could be that your samples are split unevenly.

Also you would need to scale your features:

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit

df = pd.read_csv("https://raw.githubusercontent.com/WenjinTao/Leaf-Classification--Kaggle/master/train.csv")

le = LabelEncoder()
scaler = StandardScaler()
X = df.drop(['id','species'],axis=1)
X = scaler.fit_transform(X)
y = le.fit_transform(df['species'])

strat = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=0).split(X,y)
x_train, y_train, x_test, y_test = [[X.iloc[train,:],t[train],X.iloc[test,:],t[test]] for train,test in strat][0]

If we do the training, and I would be careful about including n_neighbors = 1 :

params = {
    'n_neighbors': [2, 3, 4],
    'weights': ['uniform', 'distance'],
    'leaf_size': [5,10, 15, 20]
}

sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
grid = GridSearchCV(KNeighborsClassifier(), params, cv=sss)
grid.fit(x_train, y_train)

print(grid.best_params_)
print(grid.best_score_)

{'leaf_size': 5, 'n_neighbors': 2, 'weights': 'distance'}
0.9676258992805755

Then you can check on your test:

pred = grid.predict(x_test)
(y_test == pred).mean()

0.9831649831649831
like image 64
StupidWolf Avatar answered Dec 06 '25 22:12

StupidWolf



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!