Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why xgboost.cv and sklearn.cross_val_score give different results?

I'm trying to make a classifier on a data set. I first used XGBoost:

import xgboost as xgb
import pandas as pd
import numpy as np

train = pd.read_csv("train_users_processed_onehot.csv")
labels = train["Buy"].map({"Y":1, "N":0})

features = train.drop("Buy", axis=1)
data_dmat = xgb.DMatrix(data=features, label=labels)

params={"max_depth":5, "min_child_weight":2, "eta": 0.1, "subsamples":0.9, "colsample_bytree":0.8, "objective" : "binary:logistic", "eval_metric": "logloss"}
rounds = 180

result = xgb.cv(params=params, dtrain=data_dmat, num_boost_round=rounds, early_stopping_rounds=50, as_pandas=True, seed=23333)
print result

And the result is:

        test-logloss-mean  test-logloss-std  train-logloss-mean  
0             0.683539          0.000141            0.683407
179           0.622302          0.001504            0.606452  

We can see it is around 0.622;

But when I switch to sklearn using the exactly same parameters(I think), the result is quite different. Below is my code:

from sklearn.model_selection import cross_val_score
from xgboost.sklearn import XGBClassifier
import pandas as pd

train_dataframe = pd.read_csv("train_users_processed_onehot.csv")
train_labels = train_dataframe["Buy"].map({"Y":1, "N":0})
train_features = train_dataframe.drop("Buy", axis=1)

estimator = XGBClassifier(learning_rate=0.1, n_estimators=190, max_depth=5, min_child_weight=2, objective="binary:logistic", subsample=0.9, colsample_bytree=0.8, seed=23333)
print cross_val_score(estimator, X=train_features, y=train_labels, scoring="neg_log_loss")

and the result is:[-4.11429976 -2.08675843 -3.27346662], after reversing it is still far from 0.622.

I tossed a break point into cross_val_score, and saw that the classifier is making crazy predictions by trying to predict every tuple in the test set to be negative with about 0.99 probability.

I'm wondering where have I gone wrong. Could someone help me?

like image 305
DarkZero Avatar asked Dec 14 '16 06:12

DarkZero


People also ask

What does XGBoost CV return?

XGBoost has a very useful function called as “cv” which performs cross-validation at each boosting iteration and thus returns the optimum number of trees required. Tune tree-specific parameters ( max_depth, min_child_weight, gamma, subsample, colsample_bytree) for decided learning rate and number of trees.

What does Sklearn Cross_val_score do?

The cross_val_score() function will be used to perform the evaluation, taking the dataset and cross-validation configuration and returning a list of scores calculated for each fold.

What Cross_val_score returns?

"cross_val_score" splits the data into say 5 folds. Then for each fold it fits the data on 4 folds and scores the 5th fold. Then it gives you the 5 scores from which you can calculate a mean and variance for the score.

Does XGBoost require cross validation?

1.2 Main features of XGBoost The primary reasons we should use this algorithm are its accuracy, efficiency and feasibility. It is a linear model and a tree learning algorithm that does parallel computations on a single machine. It also has extra features for doing cross validation and computing feature importance.


1 Answers

This question is a bit old, but I ran into the problem today and figured out why the results given by xgboost.cv and sklearn.model_selection.cross_val_score are quite different.

By default cross_val_score use KFold or StratifiedKFold whose shuffle argument is False so the folds are not pulled randomly from the data.

So if you do this, then you should get the same results:

cross_val_score(estimator, X=train_features, y=train_labels, scoring="neg_log_loss",
    cv = StratifiedKFold(shuffle=True, random_state=23333))

Keep the random state in StratifiedKfold and seed in xgboost.cv same to get exactly reproducible results.

like image 195
user8101320 Avatar answered Sep 24 '22 17:09

user8101320