I am trying to run my lightgbm for feature selection as below;
initialization
# Initialize an empty array to hold feature importances
feature_importances = np.zeros(features_sample.shape[1])
# Create the model with several hyperparameters
model = lgb.LGBMClassifier(objective='binary',
boosting_type = 'goss',
n_estimators = 10000, class_weight ='balanced')
then i fit the model as below
# Fit the model twice to avoid overfitting
for i in range(2):
# Split into training and validation set
train_features, valid_features, train_y, valid_y = train_test_split(train_X, train_Y, test_size = 0.25, random_state = i)
# Train using early stopping
model.fit(train_features, train_y, early_stopping_rounds=100, eval_set = [(valid_features, valid_y)],
eval_metric = 'auc', verbose = 200)
# Record the feature importances
feature_importances += model.feature_importances_
but i get the below error
Training until validation scores don't improve for 100 rounds.
Early stopping, best iteration is: [6] valid_0's auc: 0.88648
ValueError: operands could not be broadcast together with shapes (87,) (83,) (87,)
We use StratifiedKFold to split our dataset into 5 folds, select one fold as validation set and train model with early stopping using the rest 4 folds as a training set. Then we use this model to predict outcomes for test set and record the predictions. Repeat 5 times, so every fold is validation set one time.
Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.
The motivation behind the LightGBM The information gain is basically the difference between entropy before and after the split. Entropy is a measure of uncertainty or randomness. The more randomness a variable has, the higher the entropy is.
An example for getting feature importance in lightgbm
when using train
model.
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
def plotImp(model, X , num = 20, fig_size = (40, 20)):
feature_imp = pd.DataFrame({'Value':model.feature_importance(),'Feature':X.columns})
plt.figure(figsize=fig_size)
sns.set(font_scale = 5)
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value",
ascending=False)[0:num])
plt.title('LightGBM Features (avg over folds)')
plt.tight_layout()
plt.savefig('lgbm_importances-01.png')
plt.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With