Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Feature importance using lightgbm

I am trying to run my lightgbm for feature selection as below;

initialization

# Initialize an empty array to hold feature importances
feature_importances = np.zeros(features_sample.shape[1])

# Create the model with several hyperparameters
model = lgb.LGBMClassifier(objective='binary', 
         boosting_type = 'goss', 
         n_estimators = 10000, class_weight ='balanced')

then i fit the model as below

# Fit the model twice to avoid overfitting
for i in range(2):

   # Split into training and validation set
   train_features, valid_features, train_y, valid_y = train_test_split(train_X, train_Y, test_size = 0.25, random_state = i)

   # Train using early stopping
   model.fit(train_features, train_y, early_stopping_rounds=100, eval_set = [(valid_features, valid_y)], 
             eval_metric = 'auc', verbose = 200)

   # Record the feature importances
   feature_importances += model.feature_importances_

but i get the below error

Training until validation scores don't improve for 100 rounds. 
Early stopping, best iteration is: [6]  valid_0's auc: 0.88648
ValueError: operands could not be broadcast together with shapes (87,) (83,) (87,) 
like image 419
Ian Okeyo Avatar asked Nov 21 '18 13:11

Ian Okeyo


People also ask

How does feature importance work in LightGBM?

We use StratifiedKFold to split our dataset into 5 folds, select one fold as validation set and train model with early stopping using the rest 4 folds as a training set. Then we use this model to predict outcomes for test set and record the predictions. Repeat 5 times, so every fold is validation set one time.

How do you determine the feature important in a decision tree?

Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.

What is gain in LightGBM?

The motivation behind the LightGBM The information gain is basically the difference between entropy before and after the split. Entropy is a measure of uncertainty or randomness. The more randomness a variable has, the higher the entropy is.


1 Answers

An example for getting feature importance in lightgbm when using train model.

import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

def plotImp(model, X , num = 20, fig_size = (40, 20)):
    feature_imp = pd.DataFrame({'Value':model.feature_importance(),'Feature':X.columns})
    plt.figure(figsize=fig_size)
    sns.set(font_scale = 5)
    sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", 
                                                        ascending=False)[0:num])
    plt.title('LightGBM Features (avg over folds)')
    plt.tight_layout()
    plt.savefig('lgbm_importances-01.png')
    plt.show()
like image 141
rosefun Avatar answered Oct 24 '22 23:10

rosefun