I'm trying to make a classifier on a data set. I first used XGBoost: <pre class="prettyprint"><code>import xgboost as xgb import pandas as pd import numpy as np train = pd.read_csv("train_users_processed_onehot.csv") labels = train["Buy"].map({"Y":1, "N":0}) features = train.drop("Buy", axis=1) data_dmat = xgb.DMatrix(data=features, label=labels) params={"max_depth":5, "min_child_weight":2, "eta": 0.1, "subsamples":0.9, "colsample_bytree":0.8, "objective" : "binary:logistic", "eval_metric": "logloss"} rounds = 180 result = xgb.cv(params=params, dtrain=data_dmat, num_boost_round=rounds, early_stopping_rounds=50, as_pandas=True, seed=23333) print result </code></pre> And the result is: <pre class="prettyprint"><code> test-logloss-mean test-logloss-std train-logloss-mean 0 0.683539 0.000141 0.683407 179 0.622302 0.001504 0.606452 </code></pre> We can see it is around 0.622; But when I switch to <code>sklearn</code> using the exactly same parameters(I think), the result is quite different. Below is my code: <pre class="prettyprint"><code>from sklearn.model_selection import cross_val_score from xgboost.sklearn import XGBClassifier import pandas as pd train_dataframe = pd.read_csv("train_users_processed_onehot.csv") train_labels = train_dataframe["Buy"].map({"Y":1, "N":0}) train_features = train_dataframe.drop("Buy", axis=1) estimator = XGBClassifier(learning_rate=0.1, n_estimators=190, max_depth=5, min_child_weight=2, objective="binary:logistic", subsample=0.9, colsample_bytree=0.8, seed=23333) print cross_val_score(estimator, X=train_features, y=train_labels, scoring="neg_log_loss") </code></pre> and the result is:<code>[-4.11429976 -2.08675843 -3.27346662]</code>, after reversing it is still far from 0.622. I tossed a break point into <code>cross_val_score</code>, and saw that the classifier is making crazy predictions by trying to predict every tuple in the test set to be negative with about 0.99 probability. I'm wondering where have I gone wrong. Could someone help me?

This question is a bit old, but I ran into the problem today and figured out why the results given by <code>xgboost.cv</code> and <code>sklearn.model_selection.cross_val_score</code> are quite different. By default cross_val_score use <code>KFold</code> or <code>StratifiedKFold</code> whose shuffle argument is False so the folds are not pulled randomly from the data. So if you do this, then you should get the same results: <pre class="prettyprint"><code>cross_val_score(estimator, X=train_features, y=train_labels, scoring="neg_log_loss", cv = StratifiedKFold(shuffle=True, random_state=23333)) </code></pre> Keep the <code>random state</code> in <code>StratifiedKfold</code> and <code>seed</code> in <code>xgboost.cv</code> same to get exactly reproducible results.

Why xgboost.cv and sklearn.cross_val_score give different results?

Tags:

python

machine-learning

scikit-learn

xgboost

cross-validation

I'm trying to make a classifier on a data set. I first used XGBoost:

import xgboost as xgb
import pandas as pd
import numpy as np

train = pd.read_csv("train_users_processed_onehot.csv")
labels = train["Buy"].map({"Y":1, "N":0})

features = train.drop("Buy", axis=1)
data_dmat = xgb.DMatrix(data=features, label=labels)

params={"max_depth":5, "min_child_weight":2, "eta": 0.1, "subsamples":0.9, "colsample_bytree":0.8, "objective" : "binary:logistic", "eval_metric": "logloss"}
rounds = 180

result = xgb.cv(params=params, dtrain=data_dmat, num_boost_round=rounds, early_stopping_rounds=50, as_pandas=True, seed=23333)
print result

And the result is:

        test-logloss-mean  test-logloss-std  train-logloss-mean  
0             0.683539          0.000141            0.683407
179           0.622302          0.001504            0.606452

We can see it is around 0.622;

But when I switch to sklearn using the exactly same parameters(I think), the result is quite different. Below is my code:

from sklearn.model_selection import cross_val_score
from xgboost.sklearn import XGBClassifier
import pandas as pd

train_dataframe = pd.read_csv("train_users_processed_onehot.csv")
train_labels = train_dataframe["Buy"].map({"Y":1, "N":0})
train_features = train_dataframe.drop("Buy", axis=1)

estimator = XGBClassifier(learning_rate=0.1, n_estimators=190, max_depth=5, min_child_weight=2, objective="binary:logistic", subsample=0.9, colsample_bytree=0.8, seed=23333)
print cross_val_score(estimator, X=train_features, y=train_labels, scoring="neg_log_loss")

and the result is:[-4.11429976 -2.08675843 -3.27346662], after reversing it is still far from 0.622.

I tossed a break point into cross_val_score, and saw that the classifier is making crazy predictions by trying to predict every tuple in the test set to be negative with about 0.99 probability.

I'm wondering where have I gone wrong. Could someone help me?

305

asked Dec 14 '16 06:12

DarkZero

1 Answers

This question is a bit old, but I ran into the problem today and figured out why the results given by xgboost.cv and sklearn.model_selection.cross_val_score are quite different.

By default cross_val_score use KFold or StratifiedKFold whose shuffle argument is False so the folds are not pulled randomly from the data.

So if you do this, then you should get the same results:

cross_val_score(estimator, X=train_features, y=train_labels, scoring="neg_log_loss",
    cv = StratifiedKFold(shuffle=True, random_state=23333))

Keep the random state in StratifiedKfold and seed in xgboost.cv same to get exactly reproducible results.

195

answered Sep 24 '22 17:09

user8101320

Related questions
                            
                                Regex Matching - A letter not preceded by another letter
                            
                                Why is this Haskell code so slow?
                            
                                How to remap ids to consecutive numbers quickly
                            
                                Vim double-indents python files
                            
                                'None' is not displayed as I expected in Python interactive mode
                            
                                What is the equivalent of Matlab's imadjust in python?
                            
                                How to calculate count and percentage in groupby in Python
                            
                                ServerSelectionTimeoutError when connecting to aws with pymongo
                            
                                Pandas: query string where column name contains special characters
                            
                                Conditionally calculated column for a Pandas DataFrame
                            
                                How can I change the (locale) thousands separator in Python to Arabic Unicode separator?
                            
                                python use Pyyaml and keep format
                            
                                Python pandas select rows by list of dates
                            
                                Vertical alignment of matplotlib legend labels with LaTeX math
                            
                                python - sklearn Latent Dirichlet Allocation Transform v. Fittransform
                            
                                Apache Spark reads for S3: can't pickle thread.lock objects
                            
                                Python: Trimming underscores from end of String
                            
                                Python3 reading a binary file, 4 bytes at a time and xor it with a 4 byte long key
                            
                                How to draw a small graph with community structure in networkx
                            
                                TypeError: __init__() takes 1 positional argument but 2 were given

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With