Recently, I am doing multiple experiments to compare Python XgBoost and LightGBM. It seems that this LightGBM is a new algorithm that people say it works better than XGBoost in both speed and accuracy.
This is LightGBM GitHub. This is LightGBM python API documents, here you will find python functions you can call. It can be directly called from LightGBM model and also can be called by LightGBM scikit-learn.
This is the XGBoost Python API I use. As you can see, it has very similar data structure as LightGBM python API above.
Here are what I tried:
train()
method in both XGBoost and LightGBM, yes lightGBM works faster and has higher accuracy. But this method, doesn't have cross validation.cv()
method in both algorithms, it is for cross validation. However, I didn't find a way to use it return a set of optimum parameters.GridSearchCV()
with LGBMClassifier and XGBClassifer. It works for XGBClassifer, but for LGBClassifier, it is running forever.Here are my code examples when using GridSearchCV()
with both classifiers:
XGBClassifier with GridSearchCV
param_set = {
'n_estimators':[50, 100, 500, 1000]
}
gsearch = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1,
n_estimators=100, max_depth=5,
min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
nthread=7,
objective= 'binary:logistic', scale_pos_weight=1, seed=410),
param_grid = param_set, scoring='roc_auc',n_jobs=7,iid=False, cv=10)
xgb_model2 = gsearch.fit(features_train, label_train)
xgb_model2.grid_scores_, xgb_model2.best_params_, xgb_model2.best_score_
This works very well for XGBoost, and only tool a few seconds.
LightGBM with GridSearchCV
param_set = {
'n_estimators':[20, 50]
}
gsearch = GridSearchCV(estimator = LGBMClassifier( boosting_type='gbdt', num_leaves=30, max_depth=5, learning_rate=0.1, n_estimators=50, max_bin=225,
subsample_for_bin=0.8, objective=None, min_split_gain=0,
min_child_weight=5,
min_child_samples=10, subsample=1, subsample_freq=1,
colsample_bytree=1,
reg_alpha=1, reg_lambda=0, seed=410, nthread=7, silent=True),
param_grid = param_set, scoring='roc_auc',n_jobs=7,iid=False, cv=10)
lgb_model2 = gsearch.fit(features_train, label_train)
lgb_model2.grid_scores_, lgb_model2.best_params_, lgb_model2.best_score_
However, by using this method for LightGBM, it has been running the whole morning today still nothing generated.
I am using the same dataset, a dataset contains 30000 records.
I have 2 questions:
cv()
method, is there anyway to tune optimum set of parameters?GridSearchCV()
does not work well with LightGBM? I'm wondering whether this only happens on me all it happened on others to?Observing the above time numbers, for parameter grid having 3125 combinations, the Grid Search CV took 10856 seconds (~3 hrs) whereas Halving Grid Search CV took 465 seconds (~8 mins), which is approximate 23x times faster.
GridSearchCV is a library function that is a member of sklearn's model_selection package. It helps to loop through predefined hyperparameters and fit your estimator (model) on your training set. So, in the end, you can select the best parameters from the listed hyperparameters.
Try to use n_jobs = 1
and see if it works.
In general, if you use n_jobs = -1
or n_jobs > 1
then you should protect your script by using if __name__=='__main__':
:
Simple Example:
import ...
if __name__=='__main__':
data= pd.read_csv('Prior Decompo2.csv', header=None)
X, y = data.iloc[0:, 0:26].values, data.iloc[0:,26].values
param_grid = {'C' : [0.01, 0.1, 1, 10], 'kernel': ('rbf', 'linear')}
classifier = SVC()
grid_search = GridSearchCV(estimator=classifier, param_grid=param_grid, scoring='accuracy', n_jobs=-1, verbose=42)
grid_search.fit(X,y)
Finally, can you try to run your code using n_jobs = -1
and including if __name__=='__main__':
as I explained and see if it works?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With