Plot importance variables xgboost Python

Tags:

When I plot the feature importance, I get this messy plot. I have more than 7000 variables. I understand the built-in function only selects the most important, although the final graph is unreadable. This is the complete code:

import numpy as np
import pandas as pd
df = pd.read_csv('ricerice.csv')
array=df.values
X = array[:,0:7803]
Y = array[:,7804]
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
seed=0
test_size=0.30
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=test_size, random_state=seed)
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X, Y)
import matplotlib.pyplot as plt
from matplotlib import pyplot
from xgboost import plot_importance
fig1=plt.gcf()
plot_importance(model)
plt.draw()
fig1.savefig('xgboost.png', figsize=(50, 40), dpi=1000)

Although the size of the figure, the graph is illegible. xgboost feature importance plot

236

asked Aug 18 '18 05:08

rnv86

1 Answers

There are couple of points:

To fit the model, you want to use the training dataset (X_train, y_train), not the entire dataset (X, y).
You may use the max_num_features parameter of the plot_importance() function to display only top max_num_features features (e.g. top 10).

With the above modifications to your code, with some randomly generated data the code and output are as below:

import numpy as np

# generate some random data for demonstration purpose, use your original dataset here
X = np.random.rand(1000,100)     # 1000 x 100 data
y = np.random.rand(1000).round() # 0, 1 labels

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
seed=0
test_size=0.30
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=test_size, random_state=seed)
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X_train, y_train)
import matplotlib.pylab as plt
from matplotlib import pyplot
from xgboost import plot_importance
plot_importance(model, max_num_features=10) # top 10 most important features
plt.show()

enter image description here

184

answered Oct 12 '22 13:10

Sandipan Dey

Related questions
                            
                                python -docx to extract table from word docx
                            
                                How to get Predictions with XGBoost and XGBoost using Scikit-Learn Wrapper to match?
                            
                                Numpy: assigning values to 2d array with list of indices
                            
                                Django - Supervisor : exited too quickly
                            
                                How to setup working directory in VS Code for pylint?
                            
                                Find locations on a curve where the slope changes
                            
                                Python Pandas groupby apply lambda arguments
                            
                                Efficient way to compute the Vandermonde matrix
                            
                                How to import data into google colab from google drive?
                            
                                ImportError: No module named google.oauth2
                            
                                'DataFrame' object has no attribute 'ravel' when transforming target variable?
                            
                                Train only some word embeddings (Keras)
                            
                                Inserting NULL as default in SQLAlchemy?
                            
                                K.gradients(loss, input_img)[0] return "None". (Keras CNN visualization with tensorflow backend)
                            
                                Does using scrapy-splash significantly affect scraping speed? [closed]
                            
                                pandas read sql db2 corrupts decimal
                            
                                Remove Minutes and Hours from Series
                            
                                How to create mask images from COCO dataset?
                            
                                Tensorflow InvalidArgumentError (indices) while training with Keras
                            
                                Plotting two histograms from a pandas DataFrame in one subplot using matplotlib

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Plot importance variables xgboost Python

Tags:

python

matplotlib

machine-learning

feature-selection

xgboost

rnv86

People also ask

1 Answers

Sandipan Dey

Recent Activity

Donate For Us